How to use Hive to merge small Files 07/06 Update SLTechnology News&Howtos

How to use Hive to merge small Files

2025-07-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/02 Report--

How to use Hive to merge small files, in view of this problem, this article introduces the corresponding analysis and solutions in detail, hoping to help more partners who want to solve this problem to find a more simple and easy way.

Problem background

At present, the cluster is in a very unhealthy state, and the main problem is that there are too many small files. The threshold for the number of block in a single DataNode is 500jin000, while now the block of a single DataNode is 2631, which is about 5 times the threshold. Now all DataNode are in a yellow and unhealthy state.

The problem of small files will directly bring great pressure on NameNode, which will lead to the stability of HDFS and the performance degradation of HDFS daily data reading and writing. At present, you can see that checkpoint alarms occur frequently in the NameNode of the cluster.

Through the current number of directories in the cluster, file size, the number of files, the number of Hive tables, the number of Hive databases, the number of Hive partitions for detailed data collection. Found that there are mainly too many small files in the HDFS directory, a large number of 1KB files, and even files smaller than 1KB; the specific performance is as follows: regardless of the amount of data in tables and partitions, each partition has 200files, when there is no partition, each table has 200files, and many tables are small tables, resulting in serious small file problems.

The method to solve this problem is mainly in two aspects: one is to solve the problem of small files from the source and optimize the operation in the process of derivative to reduce the output of small files, this method needs to be solved by the business side; the second is to merge the existing small files on the platform; this question describes the scheme of merging small files on the platform.

Original table condition

Through the analysis of the number and size of files in the cluster, the problem of small files basically appears in the hive table; after further analysis, it is found that there are 200 small files in each partition, which can be merged to reduce the number of small files so as to alleviate the problem of small files.

The sample table test_part has 20 rows of data, partitioned by field date_str

There are five partitions

Each partition has four files.

Execution process

The overall execution process is as follows:

1. Use create table name like tb_name to create alternate tables to keep the table structure consistent.

2. Configure to support parameters such as merge, and use insert overwrite statements to read the data from the original table and insert it into the standby table.

3. After confirming that the data of the table is consistent, delete the original table and use the alter statement to change the table name of the alternate table to the name of the original table.

Scheme description

Create a new standby table and keep the table structure consistent with the original table

Create table test_part_bak like test_part

Set the following parameters to support merging

SET hive.merge.mapfiles = true

SET hive.merge.mapredfiles = true

SET hive.merge.size.per.task = 256000000

SET hive.merge.smallfiles.avgsize = 134217728

SET hive.exec.compress.output = true

SET parquet.compression = snappy

SET hive.exec.dynamic.partition.mode = nonstrict

SET hive.exec.dynamic.partition = true

Use the insert overwrite statement to query the original table data to overwrite the standby table

Insert overwrite table test_part_bak partition (date_str) select * from test_part

The data of the standby table is consistent with the original table.

Delete the original table and change the alternate table name to the original table name

Alter table test_part_bak rename to test_part

The table data does not change after merging.

The table structure is consistent.

As you can see from the HDFS file system, the number of partitions has not changed, and several small files for each partition have been merged into one file.

This is the answer to the question about how to use Hive to merge small files. I hope the above content can be of some help to you. If you still have a lot of doubts to solve, you can follow the industry information channel to learn more about it.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.