【HiveSQL】join关联on和where的区别及效率对比
假设数据库系统支持谓词下推的前提下,内连接:内连接的两个执行计划中,对t2表都使用了,对t1表都使用了,因此可以说,内连接中where和on在执行效率上没区别。外连接:还是拿左外连接来说,右表相关的条件会使用谓词下推,而左表是否会提前过滤数据,取决于where还是on以及筛选条件是否与左表相关,1)当为on时,左表的数据必须全量读取,此时效率的差别主要取决于左表的数据量。2)当为where时,如果
测试环境:hive on spark
spark版本:3.3.1
一、执行时机
sql连接中,where属于过滤条件,用于对join的结果集进行过滤,所以理论上的执行时机在join之后。on属于关联条件,决定了满足什么样条件的数据才可以被关联到一起,因此理论上的执行时机在join时。
但是,大多数数据库系统为了提升效率都采用了一些优化技术,思想都是将where中的筛选条件或是on中的关联条件尽可能的提前到数据源侧进行筛选,目的是减少参与关联的数据量。因此它们实际的执行时机大多时候和理论上的不同。
二、对结果集的影响
内连接中,条件放在where或者on中对结果集无影响。
外连接中(以左外连接为例),因为左外连接是完全保留左表记录,on在join时生效,因此最终的结果集也会保留左表的全部记录。where是对join后的结果集进行操作,所以会过滤掉一些数据导致二者的结果集不相同。
三、效率对比
测试数据量如下:
poi_data.poi_res表:数据量8300W+
bi_report.mon_ronghe_pv表:分区表,总数据量120E+,本次采用分区20240522的数据关联,数据量5900W+,其中 bid like ‘1%’ & pv>100 的数据量120W+
两表的关联字段均无重复值。
1.内连接
1)on
select
t1.bid,
t1.name,
t1.point_x,
t1.point_y,
t2.pv
from poi_data.poi_res t1
join (select bid, pv from bi_report.mon_ronghe_pv where event_day='20240522') t2
on t1.bid=t2.bid
and t2.bid like '1%' and t2.pv>100;
== Physical Plan ==
AdaptiveSparkPlan (28)
+- == Final Plan ==
CollectLimit (17)
+- * Project (16)
+- * SortMergeJoin Inner (15)
:- * Sort (6)
: +- AQEShuffleRead (5)
: +- ShuffleQueryStage (4), Statistics(sizeInBytes=5.3 GiB, rowCount=4.57E+7)
: +- Exchange (3)
: +- * Filter (2)
: +- Scan hive poi_data.poi_res (1)
+- * Sort (14)
+- AQEShuffleRead (13)
+- ShuffleQueryStage (12), Statistics(sizeInBytes=58.5 MiB, rowCount=1.28E+6)
+- Exchange (11)
+- * Project (10)
+- * Filter (9)
+- * ColumnarToRow (8)
+- Scan parquet bi_report.mon_ronghe_pv (7)
+- == Initial Plan ==
CollectLimit (27)
+- Project (26)
+- SortMergeJoin Inner (25)
:- Sort (20)
: +- Exchange (19)
: +- Filter (18)
: +- Scan hive poi_data.poi_res (1)
+- Sort (24)
+- Exchange (23)
+- Project (22)
+- Filter (21)
+- Scan parquet bi_report.mon_ronghe_pv (7)
(1) Scan hive poi_data.poi_res
Output [4]: [bid#297, name#299, point_x#316, point_y#317]
Arguments: [bid#297, name#299, point_x#316, point_y#317], HiveTableRelation [`poi_data`.`poi_res`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Data Cols: [bid#297, type#298, name#299, address#300, phone#301, alias#302, post_code#303, catalog_id#304, c..., Partition Cols: []]
(2) Filter [codegen id : 1]
Input [4]: [bid#297, name#299, point_x#316, point_y#317]
Condition : (StartsWith(bid#297, 1) AND isnotnull(bid#297))
(3) Exchange
Input [4]: [bid#297, name#299, point_x#316, point_y#317]
Arguments: hashpartitioning(bid#297, 600), ENSURE_REQUIREMENTS, [plan_id=774]
(4) ShuffleQueryStage
Output [4]: [bid#297, name#299, point_x#316, point_y#317]
Arguments: 0
(5) AQEShuffleRead
Input [4]: [bid#297, name#299, point_x#316, point_y#317]
Arguments: coalesced
(6) Sort [codegen id : 3]
Input [4]: [bid#297, name#299, point_x#316, point_y#317]
Arguments: [bid#297 ASC NULLS FIRST], false, 0
(7) Scan parquet bi_report.mon_ronghe_pv
Output [3]: [bid#334, pv#335, event_day#338]
Batched: true
Location: InMemoryFileIndex [afs://kunpeng.afs.baidu.com:9902/user/g_spark_rdw/rdw/poi_engine/warehouse/bi_report.db/mon_ronghe_pv/event_day=20240522]
PartitionFilters: [isnotnull(event_day#338), (event_day#338 = 20240522)]
PushedFilters: [IsNotNull(bid), IsNotNull(pv), StringStartsWith(bid,1), GreaterThan(pv,100)]
ReadSchema: struct<bid:string,pv:int>
(8) ColumnarToRow [codegen id : 2]
Input [3]: [bid#334, pv#335, event_day#338]
(9) Filter [codegen id : 2]
Input [3]: [bid#334, pv#335, event_day#338]
Condition : (((isnotnull(bid#334) AND isnotnull(pv#335)) AND StartsWith(bid#334, 1)) AND (pv#335 > 100))
(10) Project [codegen id : 2]
Output [2]: [bid#334, pv#335]
Input [3]: [bid#334, pv#335, event_day#338]
(11) Exchange
Input [2]: [bid#334, pv#335]
Arguments: hashpartitioning(bid#334, 600), ENSURE_REQUIREMENTS, [plan_id=799]
(12) ShuffleQueryStage
Output [2]: [bid#334, pv#335]
Arguments: 1
(13) AQEShuffleRead
Input [2]: [bid#334, pv#335]
Arguments: coalesced
(14) Sort [codegen id : 4]
Input [2]: [bid#334, pv#335]
Arguments: [bid#334 ASC NULLS FIRST], false, 0
(15) SortMergeJoin [codegen id : 5]
Left keys [1]: [bid#297]
Right keys [1]: [bid#334]
Join condition: None
(16) Project [codegen id : 5]
Output [5]: [bid#297, name#299, point_x#316, point_y#317, pv#335]
Input [6]: [bid#297, name#299, point_x#316, point_y#317, bid#334, pv#335]
(17) CollectLimit
Input [5]: [bid#297, name#299, point_x#316, point_y#317, pv#335]
Arguments: 1000
(18) Filter
Input [4]: [bid#297, name#299, point_x#316, point_y#317]
Condition : (StartsWith(bid#297, 1) AND isnotnull(bid#297))
(19) Exchange
Input [4]: [bid#297, name#299, point_x#316, point_y#317]
Arguments: hashpartitioning(bid#297, 600), ENSURE_REQUIREMENTS, [plan_id=759]
(20) Sort
Input [4]: [bid#297, name#299, point_x#316, point_y#317]
Arguments: [bid#297 ASC NULLS FIRST], false, 0
(21) Filter
Input [3]: [bid#334, pv#335, event_day#338]
Condition : (((isnotnull(bid#334) AND isnotnull(pv#335)) AND StartsWith(bid#334, 1)) AND (pv#335 > 100))
(22) Project
Output [2]: [bid#334, pv#335]
Input [3]: [bid#334, pv#335, event_day#338]
(23) Exchange
Input [2]: [bid#334, pv#335]
Arguments: hashpartitioning(bid#334, 600), ENSURE_REQUIREMENTS, [plan_id=760]
(24) Sort
Input [2]: [bid#334, pv#335]
Arguments: [bid#334 ASC NULLS FIRST], false, 0
(25) SortMergeJoin
Left keys [1]: [bid#297]
Right keys [1]: [bid#334]
Join condition: None
(26) Project
Output [5]: [bid#297, name#299, point_x#316, point_y#317, pv#335]
Input [6]: [bid#297, name#299, point_x#316, point_y#317, bid#334, pv#335]
(27) CollectLimit
Input [5]: [bid#297, name#299, point_x#316, point_y#317, pv#335]
Arguments: 1000
(28) AdaptiveSparkPlan
Output [5]: [bid#297, name#299, point_x#316, point_y#317, pv#335]
Arguments: isFinalPlan=true
从物理执行计划可以看到第(2)步中的Filter使用条件Condition : (StartsWith(bid#297, 1) AND isnotnull(bid#297))
在t1表读取源数据时进行了过滤,在第(7)步中通过谓词下推在t2表scan源数据时使用条件PushedFilters: [IsNotNull(bid), IsNotNull(pv), StringStartsWith(bid,1), GreaterThan(pv,100)]
进行了过滤,两表都是在数据源侧进行的数据过滤,减少了shuffle和参与join的数据量。
2)where
select
t1.bid,
t1.name,
t1.point_x,
t1.point_y,
t2.pv
from poi_data.poi_res t1
join (select bid, pv from bi_report.mon_ronghe_pv where event_day='20240522') t2
on t1.bid=t2.bid
where t2.bid like '1%' and t2.pv>100;
== Physical Plan ==
AdaptiveSparkPlan (28)
+- == Final Plan ==
CollectLimit (17)
+- * Project (16)
+- * SortMergeJoin Inner (15)
:- * Sort (6)
: +- AQEShuffleRead (5)
: +- ShuffleQueryStage (4), Statistics(sizeInBytes=5.3 GiB, rowCount=4.57E+7)
: +- Exchange (3)
: +- * Filter (2)
: +- Scan hive poi_data.poi_res (1)
+- * Sort (14)
+- AQEShuffleRead (13)
+- ShuffleQueryStage (12), Statistics(sizeInBytes=58.5 MiB, rowCount=1.28E+6)
+- Exchange (11)
+- * Project (10)
+- * Filter (9)
+- * ColumnarToRow (8)
+- Scan parquet bi_report.mon_ronghe_pv (7)
+- == Initial Plan ==
CollectLimit (27)
+- Project (26)
+- SortMergeJoin Inner (25)
:- Sort (20)
: +- Exchange (19)
: +- Filter (18)
: +- Scan hive poi_data.poi_res (1)
+- Sort (24)
+- Exchange (23)
+- Project (22)
+- Filter (21)
+- Scan parquet bi_report.mon_ronghe_pv (7)
(1) Scan hive poi_data.poi_res
Output [4]: [bid#350, name#352, point_x#369, point_y#370]
Arguments: [bid#350, name#352, point_x#369, point_y#370], HiveTableRelation [`poi_data`.`poi_res`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Data Cols: [bid#350, type#351, name#352, address#353, phone#354, alias#355, post_code#356, catalog_id#357, c..., Partition Cols: []]
(2) Filter [codegen id : 1]
Input [4]: [bid#350, name#352, point_x#369, point_y#370]
Condition : (StartsWith(bid#350, 1) AND isnotnull(bid#350))
(3) Exchange
Input [4]: [bid#350, name#352, point_x#369, point_y#370]
Arguments: hashpartitioning(bid#350, 600), ENSURE_REQUIREMENTS, [plan_id=908]
(4) ShuffleQueryStage
Output [4]: [bid#350, name#352, point_x#369, point_y#370]
Arguments: 0
(5) AQEShuffleRead
Input [4]: [bid#350, name#352, point_x#369, point_y#370]
Arguments: coalesced
(6) Sort [codegen id : 3]
Input [4]: [bid#350, name#352, point_x#369, point_y#370]
Arguments: [bid#350 ASC NULLS FIRST], false, 0
(7) Scan parquet bi_report.mon_ronghe_pv
Output [3]: [bid#387, pv#388, event_day#391]
Batched: true
Location: InMemoryFileIndex [afs://kunpeng.afs.baidu.com:9902/user/g_spark_rdw/rdw/poi_engine/warehouse/bi_report.db/mon_ronghe_pv/event_day=20240522]
PartitionFilters: [isnotnull(event_day#391), (event_day#391 = 20240522)]
PushedFilters: [IsNotNull(bid), IsNotNull(pv), StringStartsWith(bid,1), GreaterThan(pv,100)]
ReadSchema: struct<bid:string,pv:int>
(8) ColumnarToRow [codegen id : 2]
Input [3]: [bid#387, pv#388, event_day#391]
(9) Filter [codegen id : 2]
Input [3]: [bid#387, pv#388, event_day#391]
Condition : (((isnotnull(bid#387) AND isnotnull(pv#388)) AND StartsWith(bid#387, 1)) AND (pv#388 > 100))
(10) Project [codegen id : 2]
Output [2]: [bid#387, pv#388]
Input [3]: [bid#387, pv#388, event_day#391]
(11) Exchange
Input [2]: [bid#387, pv#388]
Arguments: hashpartitioning(bid#387, 600), ENSURE_REQUIREMENTS, [plan_id=933]
(12) ShuffleQueryStage
Output [2]: [bid#387, pv#388]
Arguments: 1
(13) AQEShuffleRead
Input [2]: [bid#387, pv#388]
Arguments: coalesced
(14) Sort [codegen id : 4]
Input [2]: [bid#387, pv#388]
Arguments: [bid#387 ASC NULLS FIRST], false, 0
(15) SortMergeJoin [codegen id : 5]
Left keys [1]: [bid#350]
Right keys [1]: [bid#387]
Join condition: None
(16) Project [codegen id : 5]
Output [5]: [bid#350, name#352, point_x#369, point_y#370, pv#388]
Input [6]: [bid#350, name#352, point_x#369, point_y#370, bid#387, pv#388]
(17) CollectLimit
Input [5]: [bid#350, name#352, point_x#369, point_y#370, pv#388]
Arguments: 1000
(18) Filter
Input [4]: [bid#350, name#352, point_x#369, point_y#370]
Condition : (StartsWith(bid#350, 1) AND isnotnull(bid#350))
(19) Exchange
Input [4]: [bid#350, name#352, point_x#369, point_y#370]
Arguments: hashpartitioning(bid#350, 600), ENSURE_REQUIREMENTS, [plan_id=893]
(20) Sort
Input [4]: [bid#350, name#352, point_x#369, point_y#370]
Arguments: [bid#350 ASC NULLS FIRST], false, 0
(21) Filter
Input [3]: [bid#387, pv#388, event_day#391]
Condition : (((isnotnull(bid#387) AND isnotnull(pv#388)) AND StartsWith(bid#387, 1)) AND (pv#388 > 100))
(22) Project
Output [2]: [bid#387, pv#388]
Input [3]: [bid#387, pv#388, event_day#391]
(23) Exchange
Input [2]: [bid#387, pv#388]
Arguments: hashpartitioning(bid#387, 600), ENSURE_REQUIREMENTS, [plan_id=894]
(24) Sort
Input [2]: [bid#387, pv#388]
Arguments: [bid#387 ASC NULLS FIRST], false, 0
(25) SortMergeJoin
Left keys [1]: [bid#350]
Right keys [1]: [bid#387]
Join condition: None
(26) Project
Output [5]: [bid#350, name#352, point_x#369, point_y#370, pv#388]
Input [6]: [bid#350, name#352, point_x#369, point_y#370, bid#387, pv#388]
(27) CollectLimit
Input [5]: [bid#350, name#352, point_x#369, point_y#370, pv#388]
Arguments: 1000
(28) AdaptiveSparkPlan
Output [5]: [bid#350, name#352, point_x#369, point_y#370, pv#388]
Arguments: isFinalPlan=true
物理执行计划没有变化,因此可以说,当数据库支持谓词下推时,筛选条件用where还是on没有区别,数据库都会在数据源侧进行数据过滤,减少参与关联的数据量。
2.外连接
1)on
select
t1.bid,
t1.name,
t1.point_x,
t1.point_y,
t2.pv
from poi_data.poi_res t1
left join (select bid, pv from bi_report.mon_ronghe_pv where event_day='20240522') t2
on t1.bid=t2.bid
and t2.bid like '1%' and t2.pv>100;
== Physical Plan ==
AdaptiveSparkPlan (28)
+- == Final Plan ==
CollectLimit (17)
+- * Project (16)
+- * SortMergeJoin LeftOuter (15)
:- * Sort (6)
: +- AQEShuffleRead (5)
: +- ShuffleQueryStage (4), Statistics(sizeInBytes=36.5 MiB, rowCount=3.07E+5)
: +- Exchange (3)
: +- * LocalLimit (2)
: +- Scan hive poi_data.poi_res (1)
+- * Sort (14)
+- AQEShuffleRead (13)
+- ShuffleQueryStage (12), Statistics(sizeInBytes=58.5 MiB, rowCount=1.28E+6)
+- Exchange (11)
+- * Project (10)
+- * Filter (9)
+- * ColumnarToRow (8)
+- Scan parquet bi_report.mon_ronghe_pv (7)
+- == Initial Plan ==
CollectLimit (27)
+- Project (26)
+- SortMergeJoin LeftOuter (25)
:- Sort (20)
: +- Exchange (19)
: +- LocalLimit (18)
: +- Scan hive poi_data.poi_res (1)
+- Sort (24)
+- Exchange (23)
+- Project (22)
+- Filter (21)
+- Scan parquet bi_report.mon_ronghe_pv (7)
(1) Scan hive poi_data.poi_res
Output [4]: [bid#403, name#405, point_x#422, point_y#423]
Arguments: [bid#403, name#405, point_x#422, point_y#423], HiveTableRelation [`poi_data`.`poi_res`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Data Cols: [bid#403, type#404, name#405, address#406, phone#407, alias#408, post_code#409, catalog_id#410, c..., Partition Cols: []]
(2) LocalLimit [codegen id : 1]
Input [4]: [bid#403, name#405, point_x#422, point_y#423]
Arguments: 1000
(3) Exchange
Input [4]: [bid#403, name#405, point_x#422, point_y#423]
Arguments: hashpartitioning(bid#403, 600), ENSURE_REQUIREMENTS, [plan_id=1043]
(4) ShuffleQueryStage
Output [4]: [bid#403, name#405, point_x#422, point_y#423]
Arguments: 0
(5) AQEShuffleRead
Input [4]: [bid#403, name#405, point_x#422, point_y#423]
Arguments: coalesced
(6) Sort [codegen id : 3]
Input [4]: [bid#403, name#405, point_x#422, point_y#423]
Arguments: [bid#403 ASC NULLS FIRST], false, 0
(7) Scan parquet bi_report.mon_ronghe_pv
Output [3]: [bid#440, pv#441, event_day#444]
Batched: true
Location: InMemoryFileIndex [afs://kunpeng.afs.baidu.com:9902/user/g_spark_rdw/rdw/poi_engine/warehouse/bi_report.db/mon_ronghe_pv/event_day=20240522]
PartitionFilters: [isnotnull(event_day#444), (event_day#444 = 20240522)]
PushedFilters: [IsNotNull(bid), IsNotNull(pv), StringStartsWith(bid,1), GreaterThan(pv,100)]
ReadSchema: struct<bid:string,pv:int>
(8) ColumnarToRow [codegen id : 2]
Input [3]: [bid#440, pv#441, event_day#444]
(9) Filter [codegen id : 2]
Input [3]: [bid#440, pv#441, event_day#444]
Condition : (((isnotnull(bid#440) AND isnotnull(pv#441)) AND StartsWith(bid#440, 1)) AND (pv#441 > 100))
(10) Project [codegen id : 2]
Output [2]: [bid#440, pv#441]
Input [3]: [bid#440, pv#441, event_day#444]
(11) Exchange
Input [2]: [bid#440, pv#441]
Arguments: hashpartitioning(bid#440, 600), ENSURE_REQUIREMENTS, [plan_id=1067]
(12) ShuffleQueryStage
Output [2]: [bid#440, pv#441]
Arguments: 1
(13) AQEShuffleRead
Input [2]: [bid#440, pv#441]
Arguments: coalesced
(14) Sort [codegen id : 4]
Input [2]: [bid#440, pv#441]
Arguments: [bid#440 ASC NULLS FIRST], false, 0
(15) SortMergeJoin [codegen id : 5]
Left keys [1]: [bid#403]
Right keys [1]: [bid#440]
Join condition: None
(16) Project [codegen id : 5]
Output [5]: [bid#403, name#405, point_x#422, point_y#423, pv#441]
Input [6]: [bid#403, name#405, point_x#422, point_y#423, bid#440, pv#441]
(17) CollectLimit
Input [5]: [bid#403, name#405, point_x#422, point_y#423, pv#441]
Arguments: 1000
(18) LocalLimit
Input [4]: [bid#403, name#405, point_x#422, point_y#423]
Arguments: 1000
(19) Exchange
Input [4]: [bid#403, name#405, point_x#422, point_y#423]
Arguments: hashpartitioning(bid#403, 600), ENSURE_REQUIREMENTS, [plan_id=1029]
(20) Sort
Input [4]: [bid#403, name#405, point_x#422, point_y#423]
Arguments: [bid#403 ASC NULLS FIRST], false, 0
(21) Filter
Input [3]: [bid#440, pv#441, event_day#444]
Condition : (((isnotnull(bid#440) AND isnotnull(pv#441)) AND StartsWith(bid#440, 1)) AND (pv#441 > 100))
(22) Project
Output [2]: [bid#440, pv#441]
Input [3]: [bid#440, pv#441, event_day#444]
(23) Exchange
Input [2]: [bid#440, pv#441]
Arguments: hashpartitioning(bid#440, 600), ENSURE_REQUIREMENTS, [plan_id=1030]
(24) Sort
Input [2]: [bid#440, pv#441]
Arguments: [bid#440 ASC NULLS FIRST], false, 0
(25) SortMergeJoin
Left keys [1]: [bid#403]
Right keys [1]: [bid#440]
Join condition: None
(26) Project
Output [5]: [bid#403, name#405, point_x#422, point_y#423, pv#441]
Input [6]: [bid#403, name#405, point_x#422, point_y#423, bid#440, pv#441]
(27) CollectLimit
Input [5]: [bid#403, name#405, point_x#422, point_y#423, pv#441]
Arguments: 1000
(28) AdaptiveSparkPlan
Output [5]: [bid#403, name#405, point_x#422, point_y#423, pv#441]
Arguments: isFinalPlan=true
因为左关联,on中的条件属于连接条件,结果需要保留左表全部记录,所以t1表全量读取,t2表使用了谓词下推过滤。
2)where
select
t1.bid,
t1.name,
t1.point_x,
t1.point_y,
t2.pv
from poi_data.poi_res t1
left join (select bid, pv from bi_report.mon_ronghe_pv where event_day='20240522') t2
on t1.bid=t2.bid
where t2.bid like '1%' and t2.pv>100;
== Physical Plan ==
AdaptiveSparkPlan (28)
+- == Final Plan ==
CollectLimit (17)
+- * Project (16)
+- * SortMergeJoin Inner (15)
:- * Sort (6)
: +- AQEShuffleRead (5)
: +- ShuffleQueryStage (4), Statistics(sizeInBytes=5.3 GiB, rowCount=4.57E+7)
: +- Exchange (3)
: +- * Filter (2)
: +- Scan hive poi_data.poi_res (1)
+- * Sort (14)
+- AQEShuffleRead (13)
+- ShuffleQueryStage (12), Statistics(sizeInBytes=58.5 MiB, rowCount=1.28E+6)
+- Exchange (11)
+- * Project (10)
+- * Filter (9)
+- * ColumnarToRow (8)
+- Scan parquet bi_report.mon_ronghe_pv (7)
+- == Initial Plan ==
CollectLimit (27)
+- Project (26)
+- SortMergeJoin Inner (25)
:- Sort (20)
: +- Exchange (19)
: +- Filter (18)
: +- Scan hive poi_data.poi_res (1)
+- Sort (24)
+- Exchange (23)
+- Project (22)
+- Filter (21)
+- Scan parquet bi_report.mon_ronghe_pv (7)
(1) Scan hive poi_data.poi_res
Output [4]: [bid#456, name#458, point_x#475, point_y#476]
Arguments: [bid#456, name#458, point_x#475, point_y#476], HiveTableRelation [`poi_data`.`poi_res`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Data Cols: [bid#456, type#457, name#458, address#459, phone#460, alias#461, post_code#462, catalog_id#463, c..., Partition Cols: []]
(2) Filter [codegen id : 1]
Input [4]: [bid#456, name#458, point_x#475, point_y#476]
Condition : (StartsWith(bid#456, 1) AND isnotnull(bid#456))
(3) Exchange
Input [4]: [bid#456, name#458, point_x#475, point_y#476]
Arguments: hashpartitioning(bid#456, 600), ENSURE_REQUIREMENTS, [plan_id=1176]
(4) ShuffleQueryStage
Output [4]: [bid#456, name#458, point_x#475, point_y#476]
Arguments: 0
(5) AQEShuffleRead
Input [4]: [bid#456, name#458, point_x#475, point_y#476]
Arguments: coalesced
(6) Sort [codegen id : 3]
Input [4]: [bid#456, name#458, point_x#475, point_y#476]
Arguments: [bid#456 ASC NULLS FIRST], false, 0
(7) Scan parquet bi_report.mon_ronghe_pv
Output [3]: [bid#493, pv#494, event_day#497]
Batched: true
Location: InMemoryFileIndex [afs://kunpeng.afs.baidu.com:9902/user/g_spark_rdw/rdw/poi_engine/warehouse/bi_report.db/mon_ronghe_pv/event_day=20240522]
PartitionFilters: [isnotnull(event_day#497), (event_day#497 = 20240522)]
PushedFilters: [IsNotNull(bid), IsNotNull(pv), StringStartsWith(bid,1), GreaterThan(pv,100)]
ReadSchema: struct<bid:string,pv:int>
(8) ColumnarToRow [codegen id : 2]
Input [3]: [bid#493, pv#494, event_day#497]
(9) Filter [codegen id : 2]
Input [3]: [bid#493, pv#494, event_day#497]
Condition : (((isnotnull(bid#493) AND isnotnull(pv#494)) AND StartsWith(bid#493, 1)) AND (pv#494 > 100))
(10) Project [codegen id : 2]
Output [2]: [bid#493, pv#494]
Input [3]: [bid#493, pv#494, event_day#497]
(11) Exchange
Input [2]: [bid#493, pv#494]
Arguments: hashpartitioning(bid#493, 600), ENSURE_REQUIREMENTS, [plan_id=1201]
(12) ShuffleQueryStage
Output [2]: [bid#493, pv#494]
Arguments: 1
(13) AQEShuffleRead
Input [2]: [bid#493, pv#494]
Arguments: coalesced
(14) Sort [codegen id : 4]
Input [2]: [bid#493, pv#494]
Arguments: [bid#493 ASC NULLS FIRST], false, 0
(15) SortMergeJoin [codegen id : 5]
Left keys [1]: [bid#456]
Right keys [1]: [bid#493]
Join condition: None
(16) Project [codegen id : 5]
Output [5]: [bid#456, name#458, point_x#475, point_y#476, pv#494]
Input [6]: [bid#456, name#458, point_x#475, point_y#476, bid#493, pv#494]
(17) CollectLimit
Input [5]: [bid#456, name#458, point_x#475, point_y#476, pv#494]
Arguments: 1000
(18) Filter
Input [4]: [bid#456, name#458, point_x#475, point_y#476]
Condition : (StartsWith(bid#456, 1) AND isnotnull(bid#456))
(19) Exchange
Input [4]: [bid#456, name#458, point_x#475, point_y#476]
Arguments: hashpartitioning(bid#456, 600), ENSURE_REQUIREMENTS, [plan_id=1161]
(20) Sort
Input [4]: [bid#456, name#458, point_x#475, point_y#476]
Arguments: [bid#456 ASC NULLS FIRST], false, 0
(21) Filter
Input [3]: [bid#493, pv#494, event_day#497]
Condition : (((isnotnull(bid#493) AND isnotnull(pv#494)) AND StartsWith(bid#493, 1)) AND (pv#494 > 100))
(22) Project
Output [2]: [bid#493, pv#494]
Input [3]: [bid#493, pv#494, event_day#497]
(23) Exchange
Input [2]: [bid#493, pv#494]
Arguments: hashpartitioning(bid#493, 600), ENSURE_REQUIREMENTS, [plan_id=1162]
(24) Sort
Input [2]: [bid#493, pv#494]
Arguments: [bid#493 ASC NULLS FIRST], false, 0
(25) SortMergeJoin
Left keys [1]: [bid#456]
Right keys [1]: [bid#493]
Join condition: None
(26) Project
Output [5]: [bid#456, name#458, point_x#475, point_y#476, pv#494]
Input [6]: [bid#456, name#458, point_x#475, point_y#476, bid#493, pv#494]
(27) CollectLimit
Input [5]: [bid#456, name#458, point_x#475, point_y#476, pv#494]
Arguments: 1000
(28) AdaptiveSparkPlan
Output [5]: [bid#456, name#458, point_x#475, point_y#476, pv#494]
Arguments: isFinalPlan=true
where属于过滤条件,影响左关联的最终结果,所以执行计划第(2)步中将where提前到join关联之前按照bid对t1表进行过滤。
四、总结
假设数据库系统支持谓词下推的前提下,
- 内连接:内连接的两个执行计划中,对t2表都使用了
PushedFilters: [IsNotNull(bid), IsNotNull(pv), StringStartsWith(bid,1), GreaterThan(pv,100)]
,对t1表都使用了Condition : (StartsWith(bid#297, 1) AND isnotnull(bid#297))
,因此可以说,内连接中where和on在执行效率上没区别。 - 外连接:还是拿左外连接来说,右表相关的条件会使用谓词下推,而左表是否会提前过滤数据,取决于where还是on以及筛选条件是否与左表相关,1)当为on时,左表的数据必须全量读取,此时效率的差别主要取决于左表的数据量。2)当为where时,如果筛选条件涉及到左表,则会进行数据的提前过滤,否则左表仍然全量读取。
PS
在内连接的物理执行计划中,对poi_res表的过滤单独作为一个Filter步骤(2)Condition : (StartsWith(bid#297, 1) AND isnotnull(bid#297))
,而对mon_ronghe_pv表的过滤在第(7)步scan中PushedFilters: [IsNotNull(bid), IsNotNull(pv), StringStartsWith(bid,1), GreaterThan(pv,100)]
,二者有什么区别?查了一些资料,说的是可以将PushedFilters理解为在读取数据时的过滤,不满足条件的数据直接不读取。Filter时将数据读取之后,再判断是否满足条件,决定是否参与后续计算。
既然都是在数据源侧进行数据过滤,为什么Filter不能像PushedFilters那样,直接在读取数据的时候判断,减少读入的数据量呢,这样也可以提升效率,这是一开始个人的疑问。查了一些资料,说的是是否支持在scan时filter数据,主要受数据源的影响。大数据中的存储方式主要分为行式存储和列式存储,列式存储的数据存储方式和丰富的元数据对谓词下推技术有更好的支持。当前测试中,mon_ronghe_pv表的存储格式为parquet,poi_res表存储格式text。
开放原子开发者工作坊旨在鼓励更多人参与开源活动,与志同道合的开发者们相互交流开发经验、分享开发心得、获取前沿技术趋势。工作坊有多种形式的开发者活动,如meetup、训练营等,主打技术交流,干货满满,真诚地邀请各位开发者共同参与!
更多推荐
所有评论(0)