Search This Blog

Apache Spark - Explode Example (using DataFrames)

I came across this PySpark explode function which at first looked kind of counter-intuitive to me.

So I decided to write this blog post here to document it for myself. Here is the code of a small PySpark program which demonstrates how this explode function works. 


from pyspark.sql import Row
from pyspark.sql import SparkSession
from pyspark.sql import functions as funcs

spark = SparkSession.builder.appName("SparkSQL").getOrCreate()

df = spark.createDataFrame(\
[Row(id=1, int_list_column=[1,2,3], map_column={"a": "b", "c" : "d"})\
,Row(id=2, int_list_column=[1,2,3,50,70], map_column={"a": "s", "x": "y"})\
,Row(id=3, int_list_column=[40,60,2,3], map_column={10: 100})\
,Row(id=4, int_list_column=[], map_column={50: 5, 10: 1})\
]\
)

df.show()

df.select(funcs.explode(df.int_list_column).alias("int_column")).show()

df.select(funcs.explode(df.map_column).alias("Key1", "Value1")).show()

The output of this small program is as follows.

+---+-----------------+------------------+
| id|  int_list_column|        map_column|
+---+-----------------+------------------+
|  1|        [1, 2, 3]|  {a -> b, c -> d}|
|  2|[1, 2, 3, 50, 70]|  {x -> y, a -> s}|
|  3|   [40, 60, 2, 3]|       {10 -> 100}|
|  4|               []|{50 -> 5, 10 -> 1}|
+---+-----------------+------------------+

+----------+
|int_column|
+----------+
|         1|
|         2|
|         3|
|         1|
|         2|
|         3|
|        50|
|        70|
|        40|
|        60|
|         2|
|         3|
+----------+

+----+------+
|Key1|Value1|
+----+------+
|   a|     b|
|   c|     d|
|   x|     y|
|   a|     s|
|  10|   100|
|  50|     5|
|  10|     1|
+----+------+ 

We see that what the explode functions does it to explode/flatten the contents of the list column or the map column and to return (in the resulting DataFrame) one row for each value (in the list case) or for each key-value pair (in the map case).

No comments:

Post a Comment