In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-09-24 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >
Share
Shulou(Shulou.com)05/31 Report--
In this issue, Xiaobian will bring you about how the mapreduce program in Hadoop processes GBK encoded data and outputs GBK encoded data. The article is rich in content and analyzes and narrates it from a professional perspective. After reading this article, I hope you can gain something.
Hadoop Chinese encoding related problems-- mapreduce program processing GBK encoded data and outputting GBK encoded data
Example code for GBK file input and GBK file output:
When Hadoop processes GBK text, it is found that the output appears garbled. The original HADOOP is written dead UTF-8 when it involves encoding. If the file encoding format is other types (such as GBK), garbled characters will appear.
Just use transformTextToUTF8(text, "GBK") when reading Text in the mapper or reducer program; transcode to make sure you are running in UTF-8 encoding.
public static Text transformTextToUTF8(Text text, String encoding) {
String value = null;
try {
value = new String(text.getBytes(), 0, text.getLength(), encoding);
} catch (UnsupportedEncodingException e) {
e.printStackTrace();
}
return new Text(value);
}
The core code here is: String line=new String(text.getBytes(),0,text.getLength(),"GBK"); //the value here is Text type
String line=value.toString(); will output garbled characters, which is caused by the Writeable type Text. When I started, I always thought that the encapsulation of Long and LongWriteable was the same, and the Text type was the Writeable encapsulation of String. However, there are some differences between Text and String. It is a UTF-8 writable, while String in Java is Unicode character. Therefore, directly using the value.toString() method will default to UTF-8 encoded characters, so the original GBK-encoded data will become garbled after reading it directly using Text.
The correct approach is to convert the value of the input Text type to a byte array (value.getBytes()), use String's constructor String(byte[] bytes, int offset, int length, Charset charset), and construct a new String by decoding the specified byte subarray using the specified charset.
If you need to map/reduce output data in other encoding formats, you need to implement OutputFormat yourself, where you specify the encoding method, instead of using the default TextOutputFormat.
The above is how the MapReduce program in Hadoop shared by Xiaobian processes GBK encoded data and outputs GBK encoded data. If there is a similar doubt, please refer to the above analysis for understanding. If you want to know more about it, please pay attention to the industry information channel.
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
The market share of Chrome browser on the desktop has exceeded 70%, and users are complaining about
The world's first 2nm mobile chip: Samsung Exynos 2600 is ready for mass production.According to a r
A US federal judge has ruled that Google can keep its Chrome browser, but it will be prohibited from
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
About us Contact us Product review car news thenatureplanet
More Form oMedia: AutoTimes. Bestcoffee. SL News. Jarebook. Coffee Hunters. Sundaily. Modezone. NNB. Coffee. Game News. FrontStreet. GGAMEN
© 2024 shulou.com SLNews company. All rights reserved.