[PySpark] explode, explode_outer 함수 차이

프로그래밍/PySpark

[PySpark] explode, explode_outer 함수 차이

히또아빠 2023. 7. 7. 11:39

다음과 같은 스키마를 가진 테이블을 생성했다. an_array 라는 array 컬럼 하나와 a_map 으로 묶어진 key - value 컬럼이 존재한다

df = spark.createDataFrame(
    [(1, ["foo", "bar"], {"x": 1.0}), (2, [], {}), (3, None, None)],
    ("id", "an_array", "a_map")
)
df.show()

+---+----------+----------+
| id|  an_array|     a_map|
+---+----------+----------+
|  1|[foo, bar]|{x -> 1.0}|
|  2|        []|        {}|
|  3|      null|      null|
+---+----------+----------+

df.printSchema()

root
 |-- id: long (nullable = true)
 |-- an_array: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- a_map: map (nullable = true)
 |    |-- key: string
 |    |-- value: double (valueContainsNull = true)

array column 기준으로 explode 시 null 값을 제외하고 record형태로 반환해주는 반면에 explode_outer는 null 값 보존한채로 반환해준다. map구조에서도 마찬가지로 작동함을 알 수 있다.

df.select("id", "a_map", explode_outer("an_array")).show()
+---+----------+----+
| id|     a_map| col|
+---+----------+----+
|  1|{x -> 1.0}| foo|
|  1|{x -> 1.0}| bar|
|  2|        {}|null|
|  3|      null|null|
+---+----------+----+

df.select("id", "a_map", explode("an_array")).show()
+---+----------+---+
| id|     a_map|col|
+---+----------+---+
|  1|{x -> 1.0}|foo|
|  1|{x -> 1.0}|bar|
+---+----------+---+

df.select("id", "an_array", explode_outer("a_map")).show()
+---+----------+----+-----+
| id|  an_array| key|value|
+---+----------+----+-----+
|  1|[foo, bar]|   x|  1.0|
|  2|        []|null| null|
|  3|      null|null| null|
+---+----------+----+-----+

df.select("id", "an_array", explode("a_map")).show()
+---+----------+---+-----+
| id|  an_array|key|value|
+---+----------+---+-----+
|  1|[foo, bar]|  x|  1.0|
+---+----------+---+-----+

300x250

저작자표시

'프로그래밍 > PySpark' 카테고리의 다른 글

[PySpark] array_intersect로 array간 같은 value값 찾기 (0)	2023.08.21
[PySpark] Union(= unionAll) 함수로 두 데이터 프레임 합치기 (0)	2023.07.11
[PySpark] dense 벡터와 sparse 벡터, UDF로 sparse vector 만들기 (0)	2023.06.08
[PySpark] array 값 합계 컬럼 생성하기 (0)	2023.06.03
[PySpark] 데이터프레임 값을 리스트로 반환하기 (0)	2023.02.10

현재글[PySpark] explode, explode_outer 함수 차이

통계학을 전공한 데이터 분석가의 일상, IT, 공부한 내용을 기록하는 공간입니다.

일	월	화	수	목	금	토
						1
2	3	4	5	6	7	8
9	10	11	12	13	14	15
16	17	18	19	20	21	22
23	24	25	26	27	28

식뮬레이션

[PySpark] explode, explode_outer 함수 차이

'프로그래밍 > PySpark' 카테고리의 다른 글

'프로그래밍/PySpark'의 다른글

티스토리툴바

[PySpark] explode, explode_outer 함수 차이

'프로그래밍 > PySpark' 카테고리의 다른 글

'프로그래밍/PySpark'의 다른글

관련글

티스토리툴바