一文让你记住Pyspark下DataFrame的7种的Join 效果

最近看到了一片好文，虽然很简单，但是配上的插图可以让人很好的记住Pyspark 中的多种Join 类型和实际的效果。原英文链接 Introduction to Pyspark join types - Blog | luminousmen 。假设使用如下的两个DataFrame 来进行展示heroes_data = [('Deadpool', 3),('Iron man', 1),('Groot'

独家雨天

8918人浏览 · 2021-09-02 12:20:47

独家雨天 · 2021-09-02 12:20:47 发布

最近看到了一片好文，虽然很简单，但是配上的插图可以让人很好的记住Pyspark 中的多种Join 类型和实际的效果。原英文链接 Introduction to Pyspark join types - Blog | luminousmen 。

假设使用如下的两个DataFrame 来进行展示

heroes_data = [
    ('Deadpool', 3), 
    ('Iron man', 1),
    ('Groot', 7),
]
race_data = [
    ('Kryptonian', 5), 
    ('Mutant', 3), 
    ('Human', 1), 
]
heroes = spark.createDataFrame(heroes_data, ['name', 'id'])
races = spark.createDataFrame(race_data, ['race', 'id'])

实际的上的数据展示效果如下：

+--------+---+           +----------+---+
|    name| id|           |      race| id|
+--------+---+           +----------+---+
|Deadpool|  3|           |Kryptonian|  5|
|Iron man|  1|           |    Mutant|  3|
|   Groot|  7|           |     Human|  1|
+--------+---+           +----------+---+

下面的展示图片中，其中相同的颜色表示的是能够Join匹配上的数据。下面的Join都是通过ID的方式来进行关联。

下面除了 Cross Join 之间，其它的都是通过如下说明

heroes.join(races, on='id', how='left').show()

说明在不同的 Join 的方式下不同效果。

Cross join 笛卡尔积

这个比较好理解，就是heroes表的数据和races表的数据进行Join，就是将heroes表的每一行数据都同races表的每一行数据进行联合。数据的数量级就是 m*n。不考虑Join的主键。

>>> heroes.crossJoin(races).show()
+--------+---+----------+---+  
|    name| id|      race| id|
+--------+---+----------+---+
|Deadpool|  3|Kryptonian|  5|
|Deadpool|  3|    Mutant|  3|
|Deadpool|  3|     Human|  1|
|Iron man|  1|Kryptonian|  5|
|Iron man|  1|    Mutant|  3|
|Iron man|  1|     Human|  1|
|   Groot|  7|Kryptonian|  5|
|   Groot|  7|    Mutant|  3|
|   Groot|  7|     Human|  1|
+--------+---+----------+---+

Inner join 内联合

只生成同时匹配表heroes和表races的记录集

Inner join

>>> heroes.join(races, on='id', how='inner').show()
+---+--------+------+ 
| id|    name|  race|
+---+--------+------+
|  1|Iron man| Human|
|  3|Deadpool|Mutant|
+---+--------+------+

Left join / Left outer join 左外联合

left 和 left outer 是一个别名的关系。生成表heroes的所有记录，包括在表races里匹配的记录。如果没有匹配的，右边将是null。就是inner Join 的结果，再加上左边的表未匹配的所有的结果。

Left join

>>> heroes.join(races, on='id', how='left').show()
>>> heroes.join(races, on='id', how='leftouter').show()
+---+--------+------+
| id|    name|  race|
+---+--------+------+
|  7|   Groot|  null|
|  1|Iron man| Human|
|  3|Deadpool|Mutant|
+---+--------+------+

Right join / Right outer join 右外联合

同上左外联合类似。

Right join

>>> heroes.join(races, on='id', how='right').show()
>>> heroes.join(races, on='id, how='rightouter').show()
+---+--------+----------+ 
| id|    name|      race|
+---+--------+----------+
|  5|    null|Kryptonian|
|  1|Iron man|     Human|
|  3|Deadpool|    Mutant|
+---+--------+----------+

Full outer join 全外联合

outer和full 也是别名关系。生成表heroes和表races里的记录全集，包括两边都匹配的记录。如果有一边没有匹配的，缺失的这一边为null。

Full outer join

>>> heroes.join(races, on='id', how='outer').show()
>>> heroes.join(races, on='id', how='full').show()
+---+--------+----------+
| id|    name|      race|
+---+--------+----------+
|  7|   Groot|      null|
|  5|    null|Kryptonian|
|  1|Iron man|     Human|
|  3|Deadpool|    Mutant|
+---+--------+----------+

Left semi-join 左半连接

可以简单的看成是，inner join 之后，只保留能够Join上的左边表数据。

Left semi-join

>>> heroes.join(races, on='id', how='leftsemi').show()
+---+--------+
| id|    name|
+---+--------+
|  1|Iron man|
|  3|Deadpool|
+---+--------+

Left anti join

看成是Left semi-join 的取反操作，将左边中，没有匹配上的数据给取出。

Left anti join

>>> heroes.join(races, on='id', how='leftanti').show()
+---+-----+
| id| name|
+---+-----+
|  7|Groot|
+---+-----+

其它的补充：

在Join的过程中，左边和右边都不能为None，可以是空数据的表但是需要带Schema，且Schema中有指定的关联主键（on)。

使用Pyspark 中创建空的DataFrame

创建空Schema的空DataFrame
创建带Schema的空DataFrame

def create_empty_df_without_schema():
    # Create an empty RDD
    emp_RDD = spark.sparkContext.emptyRDD()
    # Create empty schema
    columns = StructType([])
    return spark.createDataFrame(data=emp_RDD,
                                 schema=columns)

def create_empty_df_with_schema():
    columns = StructType([
        StructField('name', StringType(), True),
        StructField('id', IntegerType(), True),
    ])
    # emp_RDD = spark.sparkContext.emptyRDD()
    return spark.createDataFrame(data=[],
                                 schema=columns)