The content of this page has been automatically translated by AI. If you encounter any problems while reading, you can view the corresponding content in Chinese.

Overview of DataInLong

Last updated: 2024-08-23 12:39:34

DataInLong connects and integrates various self-built data on and off the cloud quickly, addressing data platform construction, database migration backup, business upgrade, integration, data access acceleration, full-text search, and other scenarios of data integration and synchronization problems. The Tencent Cloud product DataInLong (DataInLong) provides DataInLong capabilities for WeData, supporting offline synchronization, real-time database monitoring, data reporting, and other synchronization features.

Use Limits

1. Data Synchronization: DataInLong only supports the transfer of data objects that can be abstracted as logical two-dimensional tables, such as structured, semi-structured, and unstructured (COS, etc., requiring that specific synchronized data must be abstracted as structured data) data content synchronization. The FTP method supports synchronizing completely unstructured files (such as meteorological files) to HDFS, but this transfer method does not support data content extraction.
2. Network Connectivity: Supports data storage synchronization within a single region and some cross-region data synchronization needs. Some regions can use the classic network for transmission, but connectivity is not guaranteed. If the classic network test fails, it is recommended to use the public network for connection.
3. Task Operation: Running DataInLong tasks requires using DataInLong resource groups. Complete the creation of integration resource groups before using DataInLong features. Integration resource groups include offline packages, real-time packages, etc., and can be purchased as needed based on the task type.
4. Data Consistency: DataInLong synchronization supports at least once delivery, but does not completely guarantee exactly once delivery (i.e., complete non-redundancy of data). Ensuring non-redundancy of data relies on primary keys and target-end capabilities.
5. Data Types and Precision: In offline or real-time synchronization, pay attention to type matching and precision conversion for source and target fields. If the source and target types are incompatible, or if the target field type's maximum value is smaller than the source field's maximum value (or the minimum value is larger than the source end's minimum value, or the precision is lower than the source end's precision), there is a risk of write failure or precision truncation.

Offline Synchronization

DataInLong provides offline data synchronization capabilities, which bulk read data from source tables and synchronize them to the target end periodically. For details, refer to Offline Synchronization.

Real-Time Synchronization

DataInLong offers real-time data synchronization capabilities, supporting streaming data transmission. Real-time synchronization supports real-time data consumption at single-table, sharded table, and multi-database multi-table granularity. Task types include single table synchronization, whole database synchronization, and log collection.
Single Table Synchronization: The source end is a single table or sharded table, and the target end supports only one table. Single table synchronization uses a fixed schema pairing method, requiring you to specify the field mapping relationship between the source and target tables in the task. During task operation, only the specified source field content is written to the target field. For details, refer to Single Table Synchronization Task Configuration.
Whole Database Synchronization: Whole database synchronization supports synchronizing all data from the entire instance or specified multiple database and table objects of the source end to multiple tables of the target end. This task does not need to specify the field mapping relationship between the source and target ends; by default, all source table fields are read, and fields are matched by name by default. For details, refer to Whole Database Synchronization Task Configuration.
Log Collection: Log collection actively reports log file data from CVM cloud instances, self-built servers, or TKE to external target ends using Agent and SDK methods. For details, refer to Log Collection Task Configuration.

Concepts

Data Source
DataInLong uses data sources as target objects for reading/writing during the process. A data source can be a database or a data warehouse (e.g., EMR engine instance). Before configuring a DataInLong synchronization task, configure the relevant information of the source and target databases or data warehouses on the data source management page. After configuration, you can control the synchronized read and write operations of databases or data warehouses by selecting the data source name in the synchronization task.
Network Connectivity
Before using DataInLong synchronization tasks, ensure network connectivity between the data source (including read and write ends) and the DataInLong resource group. Resources should not be blocked due to allowlist restrictions; otherwise, data transmission synchronization cannot be completed. For details, refer to Integration Connectivity and Usage Planning.
If the data source has public internet access: Purchase and create a NAT Gateway to allow integration resources to connect to the data source's VPC through the gateway. For detailed instructions, see the related NAT Gateway documentation.
If the data source is within a VPC:
If the data source is in the same VPC as the integration resource: You can use it directly.
If the data source is in a different VPC from the integration resource: Purchase a Peering Connection to interconnect the VPCs of the integration resource and the data source.
If the data source is in an IDC or another classic network environment: Purchase a VPN or Direct Connect Gateway to interconnect the VPCs of the integration resource and the data source.
Rate Limit
Rate limit is the maximum transmission speed allowed for DataInLong synchronization tasks.
Concurrent number
Concurrent number refers to the maximum number of parallel read or write operations allowed in a data synchronization task. The concurrency setting impacts the efficiency of data synchronization. Higher concurrency settings result in higher resource usage. Due to resource limitations or the characteristics of the task itself, the actual concurrency during execution may be less than or equal to this value.
Dirty Data
Dirty data refers to data that failed to be written during synchronization due to field type mismatches or errors in writing to the target data source. All failed write operations are classified as dirty data. For example, if a string data type from the source is written to an INT type target field and fails due to unreasonable type conversion, it is considered dirty data.
In offline synchronization, you can set a dirty data threshold in the task to control the maximum number of dirty data entries during synchronization. If this threshold is exceeded, the task will be interrupted.
In real-time synchronization, you can configure how dirty data is archived, ensuring that failed write operations are archived into storage to maintain an uninterrupted real-time data flow.