In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-09-18 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >
Share
Shulou(Shulou.com)06/01 Report--
In this issue, the editor will bring you about how Pyspark reads parquet data. The article is rich in content and analyzes and narrates it from a professional point of view. I hope you can get something after reading this article.
Can skip the data that does not meet the conditions, only read the required data, and reduce the amount of IO data; compression coding can reduce disk storage space, using more efficient compression coding to save storage space; only reading the required columns, support vector operation, can obtain better scanning performance. Parquet data: column storage structure, developed jointly by Twitter and Cloudera. Compared to row storage, it is characterized by:
So how do we read and use parquet data in pyspark? I use the pycharm implementation under local mode and linux as an explanation.
First, import the library file and the configuration environment:
Import osfrom pyspark import SparkContext, SparkConffrom pyspark.sql.session import SparkSession os.environ ["PYSPARK_PYTHON"] = "/ usr/bin/python3" # you need to specify conf= SparkConf () .setAppName ('test_parquet') sc = SparkContext (' local', 'test', conf=conf) spark = SparkSession (sc) for multiple python versions
Then, read it using spark to get the data in DataFrame format: host:port belongs to the host and port number
ParquetFile = r "hdfs://host:port/Felix_test/test_data.parquet" df = spark.read.parquet (parquetFile)
However, there are some ways to use data in DataFrame format, such as:
1.df.first (): displays the first piece of data in Row format
Print (df.first ())
2.df.columns: column name
3.df.count (): amount of data, number of data items
4.df.toPandas (): transfer from DataFrame format data of spark to Pandas data structure
5.df.show (): directly displays the table data; where df.show (n) indicates that only the first n rows of information are displayed
6.type (df): display data format
This is how Pyspark reads parquet data shared by the editor. If you happen to have similar doubts, you might as well refer to the above analysis to understand. If you want to know more about it, you are welcome to follow the industry information channel.
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
The market share of Chrome browser on the desktop has exceeded 70%, and users are complaining about
The world's first 2nm mobile chip: Samsung Exynos 2600 is ready for mass production.According to a r
A US federal judge has ruled that Google can keep its Chrome browser, but it will be prohibited from
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
About us Contact us Product review car news thenatureplanet
More Form oMedia: AutoTimes. Bestcoffee. SL News. Jarebook. Coffee Hunters. Sundaily. Modezone. NNB. Coffee. Game News. FrontStreet. GGAMEN
© 2024 shulou.com SLNews company. All rights reserved.