In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-01-15 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >
Share
Shulou(Shulou.com)06/02 Report--
How to use Hive to merge small files, in view of this problem, this article introduces the corresponding analysis and solutions in detail, hoping to help more partners who want to solve this problem to find a more simple and easy way.
Problem background
At present, the cluster is in a very unhealthy state, and the main problem is that there are too many small files. The threshold for the number of block in a single DataNode is 500jin000, while now the block of a single DataNode is 2631, which is about 5 times the threshold. Now all DataNode are in a yellow and unhealthy state.
The problem of small files will directly bring great pressure on NameNode, which will lead to the stability of HDFS and the performance degradation of HDFS daily data reading and writing. At present, you can see that checkpoint alarms occur frequently in the NameNode of the cluster.
Through the current number of directories in the cluster, file size, the number of files, the number of Hive tables, the number of Hive databases, the number of Hive partitions for detailed data collection. Found that there are mainly too many small files in the HDFS directory, a large number of 1KB files, and even files smaller than 1KB; the specific performance is as follows: regardless of the amount of data in tables and partitions, each partition has 200files, when there is no partition, each table has 200files, and many tables are small tables, resulting in serious small file problems.
The method to solve this problem is mainly in two aspects: one is to solve the problem of small files from the source and optimize the operation in the process of derivative to reduce the output of small files, this method needs to be solved by the business side; the second is to merge the existing small files on the platform; this question describes the scheme of merging small files on the platform.
Original table condition
Through the analysis of the number and size of files in the cluster, the problem of small files basically appears in the hive table; after further analysis, it is found that there are 200 small files in each partition, which can be merged to reduce the number of small files so as to alleviate the problem of small files.
The sample table test_part has 20 rows of data, partitioned by field date_str
There are five partitions
Each partition has four files.
Execution process
The overall execution process is as follows:
1. Use create table name like tb_name to create alternate tables to keep the table structure consistent.
2. Configure to support parameters such as merge, and use insert overwrite statements to read the data from the original table and insert it into the standby table.
3. After confirming that the data of the table is consistent, delete the original table and use the alter statement to change the table name of the alternate table to the name of the original table.
Scheme description
Create a new standby table and keep the table structure consistent with the original table
Create table test_part_bak like test_part
Set the following parameters to support merging
SET hive.merge.mapfiles = true
SET hive.merge.mapredfiles = true
SET hive.merge.size.per.task = 256000000
SET hive.merge.smallfiles.avgsize = 134217728
SET hive.exec.compress.output = true
SET parquet.compression = snappy
SET hive.exec.dynamic.partition.mode = nonstrict
SET hive.exec.dynamic.partition = true
Use the insert overwrite statement to query the original table data to overwrite the standby table
Insert overwrite table test_part_bak partition (date_str) select * from test_part
The data of the standby table is consistent with the original table.
Delete the original table and change the alternate table name to the original table name
Alter table test_part_bak rename to test_part
The table data does not change after merging.
The table structure is consistent.
As you can see from the HDFS file system, the number of partitions has not changed, and several small files for each partition have been merged into one file.
This is the answer to the question about how to use Hive to merge small files. I hope the above content can be of some help to you. If you still have a lot of doubts to solve, you can follow the industry information channel to learn more about it.
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.