sparksql optimization join

Updated to 23 days ago

Article Directory

Preface
1 concept: streamIter and lookup table (buildIter)
2 concepts: 3 ways to implement sparksql joins
3 4 ways to join
References

Preface

This article is a summary of the following two articles.
Three ways to implement Spark SQL join - Read more books and newspapers - Blog Park ()

Spark SQL Join Implementation - Cloud + Community - Tencent Cloud ()

1 concept: streamIter and lookup table (buildIter)

See the concept of stream traversal table (streamIter) and lookup table (buildIter)Spark SQL Join Implementation - Cloud + Community - Tencent Cloud ()

Generally, streamlter is a large table, bulker is a small table

2 concepts: 3 ways to implement sparksql joins

sort merge join: There is a shuffle operation, suitable for two large tables

broadcast join: broadcast the builder table to each executor, so the builder table should be smaller. When the default builder table in sparks is less than 10M, the broadcast join method is used, which is suitable for large tables + small tables

hash join: It is not enabled by default. If the sort merge join is enabled, it is not much worse than it. It is suitable for large tables + small tables (slightly larger than the small tables in broadcast)

3 4 ways to join

inner join: When we write sql statements or use DataFrmae, we canDon't worry about which one is the left table and which one is the right table, During the spark sql query optimization stage, spark will automatically set the large table to the left table, that is, streamIter, and the small table to the right table, that is, buildIter.

left outer joinThe left table shall prevail, and the matching records are found in the right table. If the search fails, a record with all fields null is returned. When we write SQL statements or use DataFrmae,Generally, let the big table be on the left and the small table be on the right.。

right outer joinThe right table shall prevail, and the matching records are found in the left table. If the search fails, a record with all fields null is returned. So, the right table is streamIter and the left table is buildIter. When we write SQL statements or use DataFrmae,Generally, let the big table be on the right and the small table be on the left。

full outer joinDon't care about the left and right tables

References

Three ways to implement Spark SQL join - Read more books and newspapers - Blog Park ()

Spark SQL Join Implementation - Cloud + Community - Tencent Cloud ()