The latest version 2.7? the latest version of DataPipeline data fusion products 07/19 Update SLTechnology News&Howtos

The latest version 2.7? the latest version of DataPipeline data fusion products

2025-07-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

While further optimizing the underlying data processing logic of the product, version 2.7 pays more attention to enhancing and optimizing users' functions in the daily management of data fusion tasks, operation monitoring and resource allocation, so as to help you manage data fusion tasks more intuitively, conveniently and stably, and improve the ease of use and stability of the system.

I. New functions

1. View or configure important tasks, failure tasks, tasks to be improved, and performance-focused tasks in the to-do list

Functional background:

For most data engineers, there are hundreds of tasks that need to be configured, managed, and monitored every day, and the importance, timeliness, and performance requirements of tasks vary greatly, including tasks that provide real-time computing data for online products, as well as lower-priority tasks such as data backup. At the same time, in order to cope with the ever-changing market and business requirements, new data fusion tasks will continue to emerge. Data engineers need to add new data tasks while ensuring the stable operation of existing tasks.

A large number of tasks of different types and states are tiled on the home page of the client, which makes it difficult for important tasks to get priority attention, tasks to be improved may be omitted, and tasks with poor performance can not be found. finding tasks, managing tasks and dealing with problems take up more working time.

After the new version is launched, users can add identities to important tasks, and the platform will also evaluate and manage the tasks according to their importance, configuration completion, running status and running efficiency. Through the to-do list, users can directly see the important tasks they are concerned about, the failure tasks that have problems, the tasks that need to be completed and the tasks with low performance. Help data engineers to improve efficiency in daily task monitoring and new requirements processing, and have an intuitive understanding of operational efficiency to ensure business continuity.

Feature details:

(1) important tasks

Work transactions usually have their own priority attributes, as do data synchronization tasks. For important tasks, DataPipeline provides star settings, which are first displayed on the home page. Users can pay attention to the status of important tasks in real time to ensure the stable operation of important tasks.

(2) failure task

Focus on the failure of the task, to ensure that the problem is not omitted, task fault handling in a comprehensive and orderly manner.

(3) inactive state

Focus on the tasks that are inactive, and clearly list the tasks that need to be further improved or modified, so as to ensure that the task configuration of the data engineer is comprehensive and orderly.

(4) performance concern

The performance focus section will show 10 batch tasks and real-time tasks with low transfer rate according to the evaluation of the task efficiency of the system. By viewing the performance concerns, you can find the tasks in poor running state in time and deal with them in advance to prevent more serious problems caused by performance problems.

two。 Tasks can be managed in groups by project

Functional background:

In previous versions, DataPipeline helped users to synchronize data from multiple sources and different structures. However, with the continuous in-depth use of products and the continuous increase in the number of system users and data tasks, the data fusion tasks of multiple projects are mixed together, resulting in some inconvenience in task configuration, monitoring and management.

We have learned that a data engineer may need to manage multiple projects at the same time, and each project may contain dozens or hundreds of data fusion tasks. When data fusion tasks cannot be managed in groups according to the project, it is time-consuming and laborious to search through names, data nodes and other information by memory.

Therefore, DataPipeline adds the function of grouping tasks according to the project, and users can manage hundreds of tasks according to the project to which the task belongs, which greatly improves the efficiency.

Feature details:

(1) support creating projects and grouping tasks through customization

(2) support to change the task grouping of multiple tasks by checking tasks.

3. You can configure specific resource groups for tasks

Functional background:

Although the DataPipeline data fusion product is based on the parallel computing framework and supports task-level high availability at the infrastructure level, it is not open to users in resource group management. When users use the previous version of DataPipeline, all data tasks run in a default resource group, so it is impossible to allocate task running resources according to the importance of the task.

This requires that users can only configure separate clusters for important tasks to ensure the stable and efficient operation of tasks. This method has many objective limitations in the actual operation, such as difficulties in applying for system resources, cost budget control and so on, which also cause great trouble to our data engineers and users.

Therefore, we decided to open the configuration and allocation of system resource groups in the new version, while we plan to open the dynamic resource provisioning feature in future releases.

For example, if the current system resource is a server with 16C64G, when the resource group cannot be assigned, the task runs as follows:

After the resource group configuration is open, users can configure an important task resource group and a general task resource group. The task running status is as follows:

Important tasks start later than other ordinary tasks. However, due to being allocated in an independent resource group, there are still enough resources to ensure the smooth operation of the task.

Feature details:

(1) Resource group configuration

When deploying DataPipeline, by modifying the configuration file, the server resources on the data source / destination side can be divided into multiple resource groups, and the business resource group can be decoupled.

Resource group profile path:

/ data/datapipeline/dpconfig/resource_group_config.json

There are two resource group configuration files on both the source side and the destination side. An example of a resource group configuration file is as follows:

The configuration details are as follows:

Note: after modifying the configuration file, you need to restart the service for the resource group configuration to take effect.

(2) assign resource groups to read and write tasks

During the task setting process, users can select the resource groups that support the task to run according to the data reading and data writing of each task.

Second, optimize the function

1. Granularity split optimization of message queue for data transmission

Functional background:

In order to better support efficient data fusion tasks, DataPipeline further splits and optimizes the granularity of data transmission message queues.

Feature details:

First, let's take a look at how data flows in DataPipeline:

In the user scenario with this requirement, the source data node is DB1,DB1, which contains three data tables: T1, T2 and T3. The destination data node is DB2,DB2, which contains three data tables: T4, T5 and T6. Data fusion requires that the data in T1, T2 and T3 are merged and written into T4, the data in T2 is synchronized to T5, and the data in T3 is synchronized to T6.

In the previous processing logic (figure 1), message queues are established according to the granularity of destination write requirements, that is, the data of T1, T2, and T3 are written to a message queue for caching, that is, message queue 1 in figure 1.

Figure 1

The cache mechanism can well support the data synchronization of T4. Because the data enters a message queue, it is necessary to split the T1, T2 and T3 data in the cache when synchronizing the data of T5 and T6, and the processing efficiency is low.

DataPipeline splits and optimizes the message queue cache granularity in data transmission (figure 2). According to the granularity of the data source data table, the message queue is split, that is, the data of data sources T1, T2 and T3 are written to three message queues for caching.

Figure 2

The data synchronized to T4 will read the message queues corresponding to T1, T2 and T3 respectively, write to the merged message queue after merging, and then be consumed by the consumption unit corresponding to T4. The tasks synchronized to T5 and T6 can respectively read the message queues corresponding to T2 and T3 for data writing.

In this way, we can simultaneously support the synchronization of multiple tables on the source side and the separate synchronization of one of the tables. As a result of splitting multiple concurrency to read data, the data synchronization rates of T2 to T5 and T3 to T6 will be significantly improved. For the process of merging and synchronizing the data of T1, T2 and T3 to T4, although an one-step merge operation within the message queue is added, the rate impact is small, which can better support the above scenarios.

two。 Support for flexible modification of data source / destination configuration information in any data synchronization task

By supporting the flexible modification of data source / destination configuration information in any data synchronization task, the data node configuration can take effect globally and improve the efficiency of task configuration.

It can be modified except for the data source / destination type, which is not allowed when other tasks are running in the data source, and the configuration of the modified data source / destination node takes effect globally.

III. Other functional enhancements and problem repair

In addition to the above functions, DataPipeline also enhances and fixes the product from the following aspects:

1. Support the modification of mailboxes in user registration information

two。 Add text notes to the data task page copy, edit, delete, and other buttons

3. Optimize the heartbeat of thread real-time tasks to support operation and maintenance monitoring

4. Optimize metadata query SQL and related logic, repair index query

5. Hive data source reconstruction and optimization.

6. Verification and optimization of Hive Kerberos

7. Optimize task stutters due to JDBC connections

Each iteration of the DataPipeline version embodies the team's in-depth thinking and active exploration of the needs of enterprise data management. I hope that in this special period, the new version can effectively help you to integrate, use and analyze data more agile and efficiently.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.