homework-jianmu/15-high.md at 62da23b8488f2903fa1ad645c1b47dba64d256c3

24 KiB

Raw Blame History

title	slug
Ingesting Data Efficiently	/developer-guide/ingesting-data-efficiently

import Tabs from "@theme/Tabs"; import TabItem from "@theme/TabItem"; import Image from '@theme/IdealImage'; import imgThread from '../assets/ingesting-data-efficiently-01.png';

This section describes how to write data to TDengine efficiently.

Principles of Efficient Writing

From the Client Application's Perspective

From the perspective of the client application, efficient data writing should consider the following factors:

The amount of data written at once. Generally, the larger the batch of data written at once, the more efficient it is (but the advantage disappears beyond a certain threshold). When writing to TDengine using SQL, try to concatenate more data in one SQL statement. Currently, the maximum length of a single SQL statement supported by TDengine is 1,048,576 (1MB) characters.
Number of concurrent connections. Generally, the more concurrent connections writing data at the same time, the more efficient it is (but efficiency may decrease beyond a certain threshold, depending on the server's processing capacity).
Distribution of data across different tables (or subtables), i.e., the adjacency of the data being written. Generally, writing data to the same table (or subtable) in each batch is more efficient than writing to multiple tables (or subtables).
Method of writing. Generally:
- Binding parameters is more efficient than writing SQL. Parameter binding avoids SQL parsing (but increases the number of calls to the C interface, which also has a performance cost).
- Writing SQL without automatic table creation is more efficient than with automatic table creation because it frequently checks whether the table exists.
- Writing SQL is more efficient than schema-less writing because schema-less writing automatically creates tables and supports dynamic changes to the table structure.

Client applications should fully and appropriately utilize these factors. In a single write operation, try to write data only to the same table (or subtable), set the batch size after testing and tuning to a value that best suits the current system's processing capacity, and similarly set the number of concurrent writing connections after testing and tuning to achieve the best writing speed in the current system.

From the Data Source's Perspective

Client applications usually need to read data from a data source before writing it to TDengine. From the data source's perspective, the following situations require adding a queue between the reading and writing threads:

There are multiple data sources, and the data generation speed of a single data source is much lower than the writing speed of a single thread, but the overall data volume is relatively large. In this case, the role of the queue is to aggregate data from multiple sources to increase the amount of data written at once.
The data generation speed of a single data source is much greater than the writing speed of a single thread. In this case, the role of the queue is to increase the concurrency of writing.
Data for a single table is scattered across multiple data sources. In this case, the role of the queue is to aggregate the data for the same table in advance, improving the adjacency of the data during writing.

If the data source for the writing application is Kafka, and the writing application itself is a Kafka consumer, then Kafka's features can be utilized for efficient writing. For example:

Write data from the same table to the same Topic and the same Partition to increase data adjacency.
Aggregate data by subscribing to multiple Topics.
Increase the concurrency of writing by increasing the number of Consumer threads.
Increase the maximum amount of data fetched each time to increase the maximum amount of data written at once.

From the Server Configuration's Perspective

From the server configuration's perspective, the number of vgroups should be set appropriately when creating the database based on the number of disks in the system, the I/O capability of the disks, and the processor's capacity to fully utilize system performance. If there are too few vgroups, the system's performance cannot be maximized; if there are too many vgroups, it will cause unnecessary resource competition. The recommended number of vgroups is typically twice the number of CPU cores, but this should still be adjusted based on the specific system resource configuration.

For more tuning parameters, please refer to Database Management and Server Configuration.

Efficient Writing Example

Scenario Design

The following example program demonstrates how to write data efficiently, with the scenario designed as follows:

The TDengine client application continuously reads data from other data sources. In the example program, simulated data generation is used to mimic reading from data sources.
The speed of a single connection writing to TDengine cannot match the speed of reading data, so the client application starts multiple threads, each establishing a connection with TDengine, and each thread has a dedicated fixed-size message queue.
The client application hashes the received data according to the table name (or subtable name) to different threads, i.e., writing to the message queue corresponding to that thread, ensuring that data belonging to a certain table (or subtable) will always be processed by a fixed thread.
Each sub-thread empties the data in its associated message queue or reaches a predetermined threshold of data volume, writes that batch of data to TDengine, and continues to process the data received afterwards.

Figure 1. Thread model for efficient writing example

Sample Code

This section provides sample code for the above scenario. The principle of efficient writing is the same for other scenarios, but the code needs to be modified accordingly.

This sample code assumes that the source data belongs to different subtables of the same supertable (meters). The program has already created this supertable in the test database before starting to write data. For subtables, they will be automatically created by the application according to the received data. If the actual scenario involves multiple supertables, only the code for automatic table creation in the write task needs to be modified.

Program Listing

Class Name	Function Description
FastWriteExample	Main program
ReadTask	Reads data from a simulated source, hashes the table name to get the Queue Index, writes to the corresponding Queue
WriteTask	Retrieves data from the Queue, forms a Batch, writes to TDengine
MockDataSource	Simulates generating data for a certain number of meters subtables
SQLWriter	WriteTask relies on this class to complete SQL stitching, automatic table creation, SQL writing, and SQL length checking
StmtWriter	Implements parameter binding for batch writing (not yet completed)
DataBaseMonitor	Counts the writing speed and prints the current writing speed to the console every 10 seconds

Below are the complete codes and more detailed function descriptions for each class.

FastWriteExample

The main program is responsible for:

Creating message queues
Starting write threads
Starting read threads
Counting the writing speed every 10 seconds

The main program exposes 4 parameters by default, which can be adjusted each time the program is started, for testing and tuning:

Number of read threads. Default is 1.
Number of write threads. Default is 3.
Total number of simulated tables. Default is 1,000. This will be evenly divided among the read threads. If the total number of tables is large, table creation will take longer, and the initial writing speed statistics may be slow.
Maximum number of records written per batch. Default is 3,000.

Queue capacity (taskQueueCapacity) is also a performance-related parameter, which can be adjusted by modifying the program. Generally speaking, the larger the queue capacity, the less likely it is to be blocked when enqueuing, the greater the throughput of the queue, but the larger the memory usage. The default value of the sample program is already set large enough.

{{#include docs/examples/java/src/main/java/com/taos/example/highvolume/FastWriteExample.java}}

ReadTask

The read task is responsible for reading data from the data source. Each read task is associated with a simulated data source. Each simulated data source can generate data for a certain number of tables. Different simulated data sources generate data for different tables.

The read task writes to the message queue in a blocking manner. That is, once the queue is full, the write operation will be blocked.

{{#include docs/examples/java/src/main/java/com/taos/example/highvolume/ReadTask.java}}

WriteTask

{{#include docs/examples/java/src/main/java/com/taos/example/highvolume/WriteTask.java}}

MockDataSource

{{#include docs/examples/java/src/main/java/com/taos/example/highvolume/MockDataSource.java}}

SQLWriter

The SQLWriter class encapsulates the logic of SQL stitching and data writing. Note that none of the tables are created in advance; instead, they are created in batches using the supertable as a template when a table not found exception is caught, and then the INSERT statement is re-executed. For other exceptions, this simply logs the SQL statement being executed at the time; you can also log more clues to facilitate error troubleshooting and fault recovery.

{{#include docs/examples/java/src/main/java/com/taos/example/highvolume/SQLWriter.java}}

DataBaseMonitor

{{#include docs/examples/java/src/main/java/com/taos/example/highvolume/DataBaseMonitor.java}}

Execution Steps

Execute the Java Example Program

Before running the program, configure the environment variable TDENGINE_JDBC_URL. If the TDengine Server is deployed on the local machine, and the username, password, and port are all default values, then you can configure:

TDENGINE_JDBC_URL="jdbc:TAOS://localhost:6030?user=root&password=taosdata"

Execute the example program in a local integrated development environment

Clone the TDengine repository

git clone git@github.com:taosdata/TDengine.git --depth 1

Open the docs/examples/java directory with the integrated development environment.
Configure the environment variable TDENGINE_JDBC_URL in the development environment. If the global environment variable TDENGINE_JDBC_URL has already been configured, you can skip this step.
Run the class com.taos.example.highvolume.FastWriteExample.

Execute the example program on a remote server

To execute the example program on a server, follow these steps:

Package the example code. Execute in the directory TDengine/docs/examples/java:
```
mvn package
```
Create an examples directory on the remote server:
```
mkdir -p examples/java
```
Copy dependencies to the specified directory on the server:
- Copy dependency packages, only once
```
scp -r .\target\lib <user>@<host>:~/examples/java
```
- Copy the jar package of this program, copy every time the code is updated
```
scp -r .\target\javaexample-1.0.jar <user>@<host>:~/examples/java
```
Configure the environment variable. Edit ~/.bash_profile or ~/.bashrc and add the following content for example:
```
export TDENGINE_JDBC_URL="jdbc:TAOS://localhost:6030?user=root&password=taosdata"
```
The above uses the default JDBC URL when TDengine Server is deployed locally. You need to modify it according to your actual situation.

Start the example program with the Java command, command template:

java -classpath lib/*:javaexample-1.0.jar  com.taos.example.highvolume.FastWriteExample <read_thread_count>  <white_thread_count> <total_table_count> <max_batch_size>

End the test program. The test program will not end automatically; after obtaining a stable writing speed under the current configuration, press CTRL + C to end the program. Below is a log output from an actual run, with machine configuration 16 cores + 64G + SSD.

   root@vm85$ java -classpath lib/*:javaexample-1.0.jar  com.taos.example.highvolume.FastWriteExample 2 12
   18:56:35.896 [main] INFO  c.t.e.highvolume.FastWriteExample - readTaskCount=2, writeTaskCount=12 tableCount=1000 maxBatchSize=3000
   18:56:36.011 [WriteThread-0] INFO  c.taos.example.highvolume.WriteTask - started
   18:56:36.015 [WriteThread-0] INFO  c.taos.example.highvolume.SQLWriter - maxSQLLength=1048576
   18:56:36.021 [WriteThread-1] INFO  c.taos.example.highvolume.WriteTask - started
   18:56:36.022 [WriteThread-1] INFO  c.taos.example.highvolume.SQLWriter - maxSQLLength=1048576
   18:56:36.031 [WriteThread-2] INFO  c.taos.example.highvolume.WriteTask - started
   18:56:36.032 [WriteThread-2] INFO  c.taos.example.highvolume.SQLWriter - maxSQLLength=1048576
   18:56:36.041 [WriteThread-3] INFO  c.taos.example.highvolume.WriteTask - started
   18:56:36.042 [WriteThread-3] INFO  c.taos.example.highvolume.SQLWriter - maxSQLLength=1048576
   18:56:36.093 [WriteThread-4] INFO  c.taos.example.highvolume.WriteTask - started
   18:56:36.094 [WriteThread-4] INFO  c.taos.example.highvolume.SQLWriter - maxSQLLength=1048576
   18:56:36.099 [WriteThread-5] INFO  c.taos.example.highvolume.WriteTask - started
   18:56:36.100 [WriteThread-5] INFO  c.taos.example.highvolume.SQLWriter - maxSQLLength=1048576
   18:56:36.100 [WriteThread-6] INFO  c.taos.example.highvolume.WriteTask - started
   18:56:36.101 [WriteThread-6] INFO  c.taos.example.highvolume.SQLWriter - maxSQLLength=1048576
   18:56:36.103 [WriteThread-7] INFO  c.taos.example.highvolume.WriteTask - started
   18:56:36.104 [WriteThread-7] INFO  c.taos.example.highvolume.SQLWriter - maxSQLLength=1048576
   18:56:36.105 [WriteThread-8] INFO  c.taos.example.highvolume.WriteTask - started
   18:56:36.107 [WriteThread-8] INFO  c.taos.example.highvolume.SQLWriter - maxSQLLength=1048576
   18:56:36.108 [WriteThread-9] INFO  c.taos.example.highvolume.WriteTask - started
   18:56:36.109 [WriteThread-9] INFO  c.taos.example.highvolume.SQLWriter - maxSQLLength=1048576
   18:56:36.156 [WriteThread-10] INFO  c.taos.example.highvolume.WriteTask - started
   18:56:36.157 [WriteThread-11] INFO  c.taos.example.highvolume.WriteTask - started
   18:56:36.158 [WriteThread-10] INFO  c.taos.example.highvolume.SQLWriter - maxSQLLength=1048576
   18:56:36.158 [ReadThread-0] INFO  com.taos.example.highvolume.ReadTask - started
   18:56:36.158 [ReadThread-1] INFO  com.taos.example.highvolume.ReadTask - started
   18:56:36.158 [WriteThread-11] INFO  c.taos.example.highvolume.SQLWriter - maxSQLLength=1048576
   18:56:46.369 [main] INFO  c.t.e.highvolume.FastWriteExample - count=18554448 speed=1855444
   18:56:56.946 [main] INFO  c.t.e.highvolume.FastWriteExample - count=39059660 speed=2050521
   18:57:07.322 [main] INFO  c.t.e.highvolume.FastWriteExample - count=59403604 speed=2034394
   18:57:18.032 [main] INFO  c.t.e.highvolume.FastWriteExample - count=80262938 speed=2085933
   18:57:28.432 [main] INFO  c.t.e.highvolume.FastWriteExample - count=101139906 speed=2087696
   18:57:38.921 [main] INFO  c.t.e.highvolume.FastWriteExample - count=121807202 speed=2066729
   18:57:49.375 [main] INFO  c.t.e.highvolume.FastWriteExample - count=142952417 speed=2114521
   18:58:00.689 [main] INFO  c.t.e.highvolume.FastWriteExample - count=163650306 speed=2069788
   18:58:11.646 [main] INFO  c.t.e.highvolume.FastWriteExample - count=185019808 speed=2136950

Program Listing

The Python example program uses a multi-process architecture and employs a cross-process message queue.

Function or Class	Description
main function	Entry point of the program, creates various subprocesses and message queues
run_monitor_process function	Creates database, supertables, tracks write speed and periodically prints to console
run_read_task function	Main logic for read processes, responsible for reading data from other data systems and distributing it to assigned queues
MockDataSource class	Simulates a data source, implements iterator interface, returns the next 1,000 records for each table in batches
run_write_task function	Main logic for write processes. Retrieves as much data as possible from the queue and writes in batches
SQLWriter class	Handles SQL writing and automatic table creation
StmtWriter class	Implements batch writing with parameter binding (not yet completed)

main function

The main function is responsible for creating message queues and launching subprocesses, which are of 3 types:

1 monitoring process, responsible for database initialization and tracking write speed
n read processes, responsible for reading data from other data systems
m write processes, responsible for writing to the database

The main function can accept 5 startup parameters, in order:

Number of read tasks (processes), default is 1
Number of write tasks (processes), default is 1
Total number of simulated tables, default is 1,000
Queue size (in bytes), default is 1,000,000
Maximum number of records written per batch, default is 3,000

{{#include docs/examples/python/fast_write_example.py:main}}

run_monitor_process

The monitoring process is responsible for initializing the database and monitoring the current write speed.

{{#include docs/examples/python/fast_write_example.py:monitor}}

run_read_task function

The read process, responsible for reading data from other data systems and distributing it to assigned queues.

{{#include docs/examples/python/fast_write_example.py:read}}

MockDataSource

Below is the implementation of the mock data source. We assume that each piece of data generated by the data source includes the target table name information. In practice, you might need certain rules to determine the target table name.

{{#include docs/examples/python/mockdatasource.py}}

run_write_task function

The write process retrieves as much data as possible from the queue and writes in batches.

{{#include docs/examples/python/fast_write_example.py:write}}

The SQLWriter class encapsulates the logic of SQL stitching and data writing. None of the tables are pre-created; instead, they are batch-created using the supertable as a template when a table does not exist error occurs, and then the INSERT statement is re-executed. For other errors, the SQL executed at the time is recorded for error troubleshooting and fault recovery. This class also checks whether the SQL exceeds the maximum length limit, based on the TDengine 3.0 limit, the supported maximum SQL length of 1,048,576 is passed in by the input parameter maxSQLLength.

SQLWriter

{{#include docs/examples/python/sql_writer.py}}

Execution Steps

Execute the Python Example Program

Prerequisites
- TDengine client driver installed
- Python3 installed, recommended version >= 3.8
- taospy installed
Install faster-fifo to replace the built-in multiprocessing.Queue in python
```
pip3 install faster-fifo
```
Click the "View Source" link above to copy the fast_write_example.py, sql_writer.py, and mockdatasource.py files.

Execute the example program

python3 fast_write_example.py <READ_TASK_COUNT> <WRITE_TASK_COUNT> <TABLE_COUNT> <QUEUE_SIZE> <MAX_BATCH_SIZE>

Below is an actual output from a run, on a machine configured with 16 cores + 64G + SSD.

root@vm85$ python3 fast_write_example.py  8 8
2022-07-14 19:13:45,869 [root] - READ_TASK_COUNT=8, WRITE_TASK_COUNT=8, TABLE_COUNT=1000, QUEUE_SIZE=1000000, MAX_BATCH_SIZE=3000
2022-07-14 19:13:48,882 [root] - WriteTask-0 started with pid 718347
2022-07-14 19:13:48,883 [root] - WriteTask-1 started with pid 718348
2022-07-14 19:13:48,884 [root] - WriteTask-2 started with pid 718349
2022-07-14 19:13:48,884 [root] - WriteTask-3 started with pid 718350
2022-07-14 19:13:48,885 [root] - WriteTask-4 started with pid 718351
2022-07-14 19:13:48,885 [root] - WriteTask-5 started with pid 718352
2022-07-14 19:13:48,886 [root] - WriteTask-6 started with pid 718353
2022-07-14 19:13:48,886 [root] - WriteTask-7 started with pid 718354
2022-07-14 19:13:48,887 [root] - ReadTask-0 started with pid 718355
2022-07-14 19:13:48,888 [root] - ReadTask-1 started with pid 718356
2022-07-14 19:13:48,889 [root] - ReadTask-2 started with pid 718357
2022-07-14 19:13:48,889 [root] - ReadTask-3 started with pid 718358
2022-07-14 19:13:48,890 [root] - ReadTask-4 started with pid 718359
2022-07-14 19:13:48,891 [root] - ReadTask-5 started with pid 718361
2022-07-14 19:13:48,892 [root] - ReadTask-6 started with pid 718364
2022-07-14 19:13:48,893 [root] - ReadTask-7 started with pid 718365
2022-07-14 19:13:56,042 [DataBaseMonitor] - count=6676310 speed=667631.0
2022-07-14 19:14:06,196 [DataBaseMonitor] - count=20004310 speed=1332800.0
2022-07-14 19:14:16,366 [DataBaseMonitor] - count=32290310 speed=1228600.0
2022-07-14 19:14:26,527 [DataBaseMonitor] - count=44438310 speed=1214800.0
2022-07-14 19:14:36,673 [DataBaseMonitor] - count=56608310 speed=1217000.0
2022-07-14 19:14:46,834 [DataBaseMonitor] - count=68757310 speed=1214900.0
2022-07-14 19:14:57,280 [DataBaseMonitor] - count=80992310 speed=1223500.0
2022-07-14 19:15:07,689 [DataBaseMonitor] - count=93805310 speed=1281300.0
2022-07-14 19:15:18,020 [DataBaseMonitor] - count=106111310 speed=1230600.0
2022-07-14 19:15:28,356 [DataBaseMonitor] - count=118394310 speed=1228300.0
2022-07-14 19:15:38,690 [DataBaseMonitor] - count=130742310 speed=1234800.0
2022-07-14 19:15:49,000 [DataBaseMonitor] - count=143051310 speed=1230900.0
2022-07-14 19:15:59,323 [DataBaseMonitor] - count=155276310 speed=1222500.0
2022-07-14 19:16:09,649 [DataBaseMonitor] - count=167603310 speed=1232700.0
2022-07-14 19:16:19,995 [DataBaseMonitor] - count=179976310 speed=1237300.0

:::note When using the Python connector to connect to TDengine with multiple processes, there is a limitation: connections cannot be established in the parent process; all connections must be created in the child processes. If a connection is created in the parent process, any connection attempts in the child processes will be perpetually blocked. This is a known issue.

:::

24 KiB Raw Blame History