[PySpark] regexp_replace 함수

프로그래밍/PySpark

[PySpark] regexp_replace 함수

히또아빠 2023. 1. 17. 16:27

PySpark 데이터 프레임에 있는 string value들을 다른값으로 바꾸거나 처리하는데 SQL string functions인 regexp_replace(), translate() 및 overlay()등을 사용할 수 있다. 그 중에서 PySpark SQL 함수인 regexp_replace() 사용하면 string column을 another string/substring column으로 생성할 수 있다.

예시를 보여주기 위해 우선, 데이터 프레임을 생성한다. 각각 고유식별 번호, 성별 + 지역, 출생일 이다.

from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local[1]").appName("regexp_replace").getOrCreate()
address = [("RJFK-SLFKW-DFG1T","M BS", "1990-01-13"),
    ("aedw-dg93r-d62g1","W SE", "2000-11-30"),
    ("DFGE-FD23k-DA4G1", "M GJ", "1999-10-08")]
df =spark.createDataFrame(address,["id","demo","birth"])
df.show()

#+----------------+----+----------+
#|              id|demo|     birth|
#+----------------+----+----------+
#|RJFK-SLFKW-DFG1T|M BS|1990-01-13|
#|aedw-dg93r-d62g1|W SE|2000-11-30|
#|DFGE-FD23k-DA4G1|M GJ|1999-10-08|
#+----------------+----+----------+

df.printSchema()

#root
# |-- id: string (nullable = true)
# |-- demo: string (nullable = true)
# |-- birth: long (nullable = true)

간단하게 id의 "-"를 제거

from pyspark.sql.functions import regexp_replace

df = df.withColumn("id", regexp_replace("id", "-", ""))
df.show()

#+--------------+----+----------+
#|            id|demo|     birth|
#+--------------+----+----------+
#|RJFKSLFKWDFG1T|M BS|1990-01-13|
#|aedwdg93rd62g1|W SE|2000-11-30|
#|DFGEFD23kDA4G1|M GJ|1999-11-08|
#+--------------+----+----------+

demo 정보의 M -> Man, W -> Woman 변경, 여기서 startswith는 첫단어, endswith는 마지막 단어

split을 이용해 sex, location 새로운 column 추가

from pyspark.sql.functions import when, split

df = df.withColumn("sex", \
    when(df.demo.startswith("M"), regexp_replace(df.demo, "M", "Man"))\
    .when(df.demo.startswith("W"), regexp_replace(df.demo, "W", "Woman")))\
    .withColumn("location",\
    when(df.demo.endswith("BS"), regexp_replace(df.demo, "BS", "Busan"))\
    .when(df.demo.endswith("SE"), regexp_replace(df.demo, "SE", "Seoul"))\
    .when(df.demo.endswith("GJ"), regexp_replace(df.demo, "GJ", "Gwangju"))\
    .otherwise(df.demo))

df.show()

#+----------------+----+----------+--------+---------+
#|              id|demo|     birth|     sex| location|
#+----------------+----+----------+--------+---------+
#|RJFK-SLFKW-DFG1T|M BS|1990-01-13|  Man BS|  M Busan|
#|aedw-dg93r-d62g1|W SE|2000-11-30|Woman SE|  W Seoul|
#|DFGE-FD23k-DA4G1|M GJ|1999-11-08|  Man GJ|M Gwangju|
#+----------------+----+----------+--------+---------+

df = df.withColumn("sex", split(df['sex'], ' ').getItem(0))\
        .withColumn("location", split(df['location'], ' ').getItem(1))

df.show()

#+----------------+----+----------+-----+--------+
#|              id|demo|     birth|  sex|location|
#+----------------+----+----------+-----+--------+
#|RJFK-SLFKW-DFG1T|M BS|1990-01-13|  Man|   Busan|
#|aedw-dg93r-d62g1|W SE|2000-11-30|Woman|   Seoul|
#|DFGE-FD23k-DA4G1|M GJ|1999-11-08|  Man| Gwangju|
#+----------------+----+----------+-----+--------+

from pyspark.sql.functions import lit, col, concat, lower

df = df.withColumn("id", regexp_replace(df.id,"-",""))\
    .withColumn("a", lit("a_")).withColumn("b", lit("b_"))\
    .withColumn("id", lower(col("id")))\
    .withColumn("id", concat(col("a"), col("b"), col("id")))

df.show()

#+------------------+----+----------+-----+--------+---+---+
#|                id|demo|     birth|  sex|location|  a|  b|
#+------------------+----+----------+-----+--------+---+---+
#|a_b_rjfkslfkwdfg1t|M BS|1990-01-13|  Man|   Busan| a_| b_|
#|a_b_aedwdg93rd62g1|W SE|2000-11-30|Woman|   Seoul| a_| b_|
#|a_b_dfgefd23kda4g1|M GJ|1999-11-08|  Man| Gwangju| a_| b_|
#+------------------+----+----------+-----+--------+---+---+

300x250

저작자표시 (새창열림)

'프로그래밍 > PySpark' 카테고리의 다른 글

[PySpark] 데이터프레임 값을 리스트로 반환하기 (0)	2023.02.10
[PySpark] 빈 데이터 프레임 생성하고 데이터 집어넣기 (0)	2023.02.10
[PySpark] 랜덤표본추출(sample, sampleBy, take_Sample) (1)	2023.02.03
[PySpark] SparkConf로 Spark 환경설정 (0)	2023.02.03
[PySpark] 백분위수(percentile), 사분위수(quantile) (0)	2023.01.25

현재글[PySpark] regexp_replace 함수

통계학을 전공한 데이터 분석가의 일상, IT, 공부한 내용을 기록하는 공간입니다.

일	월	화	수	목	금	토
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30

식뮬레이션