What is the essence of RDD in Spark 07/19 Update SLTechnology News&Howtos

What is the essence of RDD in Spark

2025-07-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

This article mainly shows you "what is the essence of RDD in Spark". The content is simple and clear. I hope it can help you solve your doubts. Let me lead you to study and learn this article "what is the essence of RDD in Spark".

What is the essence of RDD?

A RDD is essentially a function, while the transformation of a RDD is nothing more than a nesting of functions. RDD, I think there are two categories:

Enter RDD, such as KafkaRDD,JdbcRDD

Convert RDD, such as MapPartitionsRDD

Let's take the following code as an example:

Sc.textFile ("abc.log"). Map (). SaveAsTextFile (")

TextFile will build a NewHadoopRDD

After the map function runs, it builds a MapPartitionsRDD.

SaveAsTextFile triggers the execution of the actual process code

So RDD is just an encapsulation of a function, and when a function has finished processing the data, we get a dataset of RDD (a virtual one, which will be explained later).

NewHadoopRDD is the source of data, each parition is responsible for obtaining data, and the acquisition process is to obtain a record through iterator.next. Suppose that at some point you get a piece of data A, which is immediately processed by the function in map to B (after the conversion is completed), and then starts writing to HDFS. This is true for other data duplicates. So the whole process:

In theory, the actual data in memory in a MapPartitionsRDD is equal to the number of its Partition, which is a very small number.

NewHadoopRDD will be slightly more, because it belongs to the data source, reading the file, assuming that the buffer reading the file is 1m, then at most, the partitionNum*1M data is in memory.

The same is true for saveAsTextFile. Writing files to HDFS requires buffer, and the maximum amount of data is buffer* partitionNum.

So the whole process is actually a streaming process, in which a piece of data is processed by a function wrapped by each RDD.

I mentioned the nested function repeatedly just now. How do you know it is nested?

If you write a code like this:

Sc.textFile ("abc.log"). Map (). SaveAsTextFile (")

There are thousands of map, and it is likely that the stack will overflow. Why? In fact, the function is too deeply nested.

According to the above logic, memory usage is actually very small, and it is not difficult for 10GB of memory to run 100T of data. But why does Spark often hang up because of memory problems? Let's move on.

What is the nature of Shuffle?

That's why we need to share Stage. Each Stage is actually what I said above, a set of data is processed by N nested functions (that is, your transform actions). Encountered Shuffle, was cut, the so-called Shuffle, in essence, according to the rules of the data are temporarily dropped to the disk, equivalent to the completion of a saveAsTextFile action, but save the local disk. The next Stage that is sliced then uses the data from the local disk as the data source to resume the process described above.

Let's describe it again:

The so-called Shuffle is just to split the processing flow, add an Action action stored on disk to the previous segment of the split (we call it Stage M), and turn the data source of the next segment of the split (Stage M) into a disk file stored by Stage M. Each Stage can follow my description above, so that each piece of data can be processed by N nested functions, and finally stored through user-specified actions.

Why does Shuffle easily cause Spark to hang up?

As we mentioned earlier, Shuffle is just secretly adding a saveAsLocalDiskFile-like action for you. However, writing to disk is a costly move. So we put the data into memory as much as possible, and then write it to the file in batches, and read the disk file is also an act of spending memory. Put the data in memory, there is a problem, such as 10000 pieces of data, how much memory will be taken up? This is actually very difficult to predict. So if you are not careful, it will easily lead to memory overflow. In fact, this is also a very helpless thing.

What does it mean that we do Cache/Persist?

In fact, it is to add a saveAsMemoryBlockFile action to a Stage, and then the next time you ask for data, you don't have to forget it. This in-memory data represents the result of some RDD processing. This is why Spark is the place of memory computing engine. In MR, you put it in HDFS, but Spark allows you to put intermediate results in memory.

The above is all the content of the article "what is the essence of RDD in Spark". Thank you for reading! I believe we all have a certain understanding, hope to share the content to help you, if you want to learn more knowledge, welcome to follow the industry information channel!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.