好得很程序员自学网

<tfoot draggable='sEl'></tfoot>

PostgreSQL 对IN,EXISTS,ANY/ALL,JOIN的sql优化方案

测试环境:

?

1

2

3

4

5

6

postgres=# select version();      

                          version                       

---------------------------------------------------------------------------------------------------------

  PostgreSQL 11.9 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-39), 64- bit

(1 row)

postgres=#

数据准备:

?

1

$ pgbench -i -s 10

?

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

postgres=# \d

        List of relations

  Schema |    Name     | Type | Owner

--------+------------------+-------+----------

  public | pgbench_accounts | table | postgres

  public | pgbench_branches | table | postgres

  public | pgbench_history | table | postgres

  public | pgbench_tellers | table | postgres

(4 rows )

 

postgres=# select * from pgbench_accounts limit 1;

  aid | bid | abalance |                    filler                   

-----+-----+----------+--------------------------------------------------------------------------------------

   1 |  1 |    0 |                                         

(1 row)

 

postgres=# select * from pgbench_branches limit 1;

  bid | bbalance | filler

-----+----------+--------

   1 |    0 |

(1 row)

 

postgres=# select * from pgbench_history limit 1;

  tid | bid | aid | delta | mtime | filler

-----+-----+-----+-------+-------+--------

(0 rows )

 

postgres=# select * from pgbench_tellers limit 1;

  tid | bid | tbalance | filler

-----+-----+----------+--------

   1 |  1 |    0 |

(1 row)

 

postgres=# select * from pgbench_branches;

  bid | bbalance | filler

-----+----------+--------

   1 |    0 |

   2 |    0 |

   3 |    0 |

   4 |    0 |

   5 |    0 |

   6 |    0 |

   7 |    0 |

   8 |    0 |

   9 |    0 |

  10 |    0 |

(10 rows )

 

postgres=# update pgbench_branches set bbalance=4500000 where bid in (4,7);

UPDATE 2

postgres=#

IN语句

查询要求:找出那些余额(balance)大于0的每个分支(branch)在表在pgbench_accounts中有多少个账户

1.使用IN子句 

?

1

2

3

4

5

6

7

8

SELECT

   count ( aid ),bid

FROM

   pgbench_accounts

WHERE

   bid IN ( SELECT bid FROM pgbench_branches WHERE bbalance > 0 )

GROUP BY

   bid;

2.使用ANY子句

?

1

2

3

4

5

6

7

8

SELECT

   count ( aid ),bid

FROM

   pgbench_accounts

WHERE

   bid = ANY ( SELECT bid FROM pgbench_branches WHERE bbalance > 0 )

GROUP BY

   bid;

  

3.使用EXISTS子句

?

1

2

3

4

5

6

7

8

SELECT

   count ( aid ),bid

FROM

   pgbench_accounts

WHERE

   EXISTS ( SELECT bid FROM pgbench_branches WHERE bbalance > 0 AND pgbench_accounts.bid = pgbench_branches.bid )

GROUP BY

   bid;

  

4.使用INNER JOIN

?

1

2

3

4

5

6

7

8

9

SELECT

   count ( aid ),a.bid

FROM

   pgbench_accounts a

   JOIN pgbench_branches b ON a.bid = b.bid

WHERE

   b.bbalance > 0

GROUP BY

   a.bid;

在完成这个查询要求的时候,有人可能会假设exists和inner join性能可能会更好,因为他们可以使用两表连接的逻辑和优化。而IN和ANY子句需要使用子查询。

然而,PostgreSQL(10版本之后)已经智能的足以对上面四种写法产生相同的执行计划!

所有上面的写法都会产生相同的执行计划:

?

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

                                       QUERY PLAN                                     

------------------------------------------------------------------------------------------------------------------------------------------------------------------

  Finalize GroupAggregate (cost=23327.73..23330.26 rows =10 width=12) (actual time =97.199..99.014 rows =2 loops=1)

   Group Key : a.bid

   -> Gather Merge (cost=23327.73..23330.06 rows =20 width=12) (actual time =97.191..99.006 rows =6 loops=1)

      Workers Planned: 2

      Workers Launched: 2

      -> Sort (cost=22327.70..22327.73 rows =10 width=12) (actual time =93.762..93.766 rows =2 loops=3)

         Sort Key : a.bid

         Sort Method: quicksort Memory: 25kB

         Worker 0: Sort Method: quicksort Memory: 25kB

         Worker 1: Sort Method: quicksort Memory: 25kB

         -> Partial HashAggregate (cost=22327.44..22327.54 rows =10 width=12) (actual time =93.723..93.727 rows =2 loops=3)

            Group Key : a.bid

            -> Hash Join (cost=1.14..22119.10 rows =41667 width=8) (actual time =24.024..83.263 rows =66667 loops=3)

               Hash Cond: (a.bid = b.bid)

               -> Parallel Seq Scan on pgbench_accounts a (cost=0.00..20560.67 rows =416667 width=8) (actual time =0.023..43.151 rows =333333 loops=3)

               -> Hash (cost=1.12..1.12 rows =1 width=4) (actual time =0.027..0.028 rows =2 loops=3)

                  Buckets: 1024 Batches: 1 Memory Usage: 9kB

                  -> Seq Scan on pgbench_branches b (cost=0.00..1.12 rows =1 width=4) (actual time =0.018..0.020 rows =2 loops=3)

                     Filter: (bbalance > 0)

                     Rows Removed by Filter: 8

  Planning Time : 0.342 ms

  Execution Time : 99.164 ms

(22 rows )

那么,我们是否可以得出这样的结论:我们可以随意地编写查询,而PostgreSQL的智能将会处理其余的问题?!

等等!

如果我们考虑排除情况,事情会变得不同。

排除查询

查询要求:找出那些余额(balance)不大于0的每个分支(branch)在表在pgbench_accounts中有多少个账户

1.使用NOT IN

?

1

2

3

4

5

6

7

8

SELECT

   count ( aid ),bid

FROM

   pgbench_accounts

WHERE

   bid NOT IN ( SELECT bid FROM pgbench_branches WHERE bbalance > 0 )

GROUP BY

   bid;

执行计划:

?

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

                                     QUERY PLAN                                   

----------------------------------------------------------------------------------------------------------------------------------------------------------

  Finalize GroupAggregate (cost=23645.42..23647.95 rows =10 width=12) (actual time =128.606..130.502 rows =8 loops=1)

   Group Key : pgbench_accounts.bid

   -> Gather Merge (cost=23645.42..23647.75 rows =20 width=12) (actual time =128.598..130.490 rows =24 loops=1)

      Workers Planned: 2

      Workers Launched: 2

      -> Sort (cost=22645.39..22645.42 rows =10 width=12) (actual time =124.960..124.963 rows =8 loops=3)

         Sort Key : pgbench_accounts.bid

         Sort Method: quicksort Memory: 25kB

         Worker 0: Sort Method: quicksort Memory: 25kB

         Worker 1: Sort Method: quicksort Memory: 25kB

         -> Partial HashAggregate (cost=22645.13..22645.23 rows =10 width=12) (actual time =124.917..124.920 rows =8 loops=3)

            Group Key : pgbench_accounts.bid

            -> Parallel Seq Scan on pgbench_accounts (cost=1.13..21603.46 rows =208333 width=8) (actual time =0.078..83.134 rows =266667 loops=3)

               Filter: ( NOT (hashed SubPlan 1))

               Rows Removed by Filter: 66667

               SubPlan 1

                -> Seq Scan on pgbench_branches (cost=0.00..1.12 rows =1 width=4) (actual time =0.020..0.021 rows =2 loops=3)

                   Filter: (bbalance > 0)

                   Rows Removed by Filter: 8

  Planning Time : 0.310 ms

  Execution Time : 130.620 ms

(21 rows )

 

postgres=#

2.使用<>ALL

?

1

2

3

4

5

6

7

8

SELECT

   count ( aid ),bid

FROM

   pgbench_accounts

WHERE

   bid <> ALL ( SELECT bid FROM pgbench_branches WHERE bbalance > 0 )

GROUP BY

   bid;

执行计划:

?

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

                                      QUERY PLAN                                   

------------------------------------------------------------------------------------------------------------------------------------------------------------

  Finalize GroupAggregate (cost=259581.79..259584.32 rows =10 width=12) (actual time =418.220..419.913 rows =8 loops=1)

   Group Key : pgbench_accounts.bid

   -> Gather Merge (cost=259581.79..259584.12 rows =20 width=12) (actual time =418.212..419.902 rows =24 loops=1)

      Workers Planned: 2

      Workers Launched: 2

      -> Sort (cost=258581.76..258581.79 rows =10 width=12) (actual time =413.906..413.909 rows =8 loops=3)

         Sort Key : pgbench_accounts.bid

         Sort Method: quicksort Memory: 25kB

         Worker 0: Sort Method: quicksort Memory: 25kB

         Worker 1: Sort Method: quicksort Memory: 25kB

         -> Partial HashAggregate (cost=258581.50..258581.60 rows =10 width=12) (actual time =413.872..413.875 rows =8 loops=3)

            Group Key : pgbench_accounts.bid

            -> Parallel Seq Scan on pgbench_accounts (cost=0.00..257539.83 rows =208333 width=8) (actual time =0.054..367.244 rows =266667 loops=3)

               Filter: (SubPlan 1)

               Rows Removed by Filter: 66667

               SubPlan 1

                -> Materialize (cost=0.00..1.13 rows =1 width=4) (actual time =0.000..0.001 rows =2 loops=1000000)

                   -> Seq Scan on pgbench_branches (cost=0.00..1.12 rows =1 width=4) (actual time =0.001..0.001 rows =2 loops=337880)

                      Filter: (bbalance > 0)

                      Rows Removed by Filter: 8

  Planning Time : 0.218 ms

  Execution Time : 420.035 ms

(22 rows )

postgres=#

3.使用NOT EXISTS

?

1

2

3

4

5

6

7

8

SELECT

   count ( aid ),bid

FROM

   pgbench_accounts

WHERE

   NOT EXISTS ( SELECT bid FROM pgbench_branches WHERE bbalance > 0 AND pgbench_accounts.bid = pgbench_branches.bid )

GROUP BY

   bid;

执行计划:

?

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

                                       QUERY PLAN                                    

----------------------------------------------------------------------------------------------------------------------------------------------------------------

  Finalize GroupAggregate (cost=28327.72..28330.25 rows =10 width=12) (actual time =152.024..153.931 rows =8 loops=1)

   Group Key : pgbench_accounts.bid

   -> Gather Merge (cost=28327.72..28330.05 rows =20 width=12) (actual time =152.014..153.917 rows =24 loops=1)

      Workers Planned: 2

      Workers Launched: 2

      -> Sort (cost=27327.70..27327.72 rows =10 width=12) (actual time =147.782..147.786 rows =8 loops=3)

         Sort Key : pgbench_accounts.bid

         Sort Method: quicksort Memory: 25kB

         Worker 0: Sort Method: quicksort Memory: 25kB

         Worker 1: Sort Method: quicksort Memory: 25kB

         -> Partial HashAggregate (cost=27327.43..27327.53 rows =10 width=12) (actual time =147.732..147.737 rows =8 loops=3)

            Group Key : pgbench_accounts.bid

            -> Hash Anti Join (cost=1.14..25452.43 rows =375000 width=8) (actual time =0.134..101.884 rows =266667 loops=3)

               Hash Cond: (pgbench_accounts.bid = pgbench_branches.bid)

               -> Parallel Seq Scan on pgbench_accounts (cost=0.00..20560.67 rows =416667 width=8) (actual time =0.032..45.174 rows =333333 loops=3)

               -> Hash (cost=1.12..1.12 rows =1 width=4) (actual time =0.036..0.037 rows =2 loops=3)

                  Buckets: 1024 Batches: 1 Memory Usage: 9kB

                  -> Seq Scan on pgbench_branches (cost=0.00..1.12 rows =1 width=4) (actual time =0.025..0.027 rows =2 loops=3)

                     Filter: (bbalance > 0)

                     Rows Removed by Filter: 8

  Planning Time : 0.322 ms

  Execution Time : 154.040 ms

(22 rows )

postgres=#

4.使用LEFT JOIN和IS NULL

?

1

2

3

4

5

6

7

8

9

SELECT

   count ( aid ),a.bid

FROM

   pgbench_accounts a

   LEFT JOIN pgbench_branches b ON a.bid = b.bid AND b.bbalance > 0

WHERE

   b.bid IS NULL

GROUP BY

   a.bid;

执行计划:

?

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

                                       QUERY PLAN                                     

------------------------------------------------------------------------------------------------------------------------------------------------------------------

  Finalize GroupAggregate (cost=28327.72..28330.25 rows =10 width=12) (actual time =145.298..147.096 rows =8 loops=1)

   Group Key : a.bid

   -> Gather Merge (cost=28327.72..28330.05 rows =20 width=12) (actual time =145.288..147.083 rows =24 loops=1)

      Workers Planned: 2

      Workers Launched: 2

      -> Sort (cost=27327.70..27327.72 rows =10 width=12) (actual time =141.883..141.887 rows =8 loops=3)

         Sort Key : a.bid

         Sort Method: quicksort Memory: 25kB

         Worker 0: Sort Method: quicksort Memory: 25kB

         Worker 1: Sort Method: quicksort Memory: 25kB

         -> Partial HashAggregate (cost=27327.43..27327.53 rows =10 width=12) (actual time =141.842..141.847 rows =8 loops=3)

            Group Key : a.bid

            -> Hash Anti Join (cost=1.14..25452.43 rows =375000 width=8) (actual time =0.087..99.535 rows =266667 loops=3)

               Hash Cond: (a.bid = b.bid)

               -> Parallel Seq Scan on pgbench_accounts a (cost=0.00..20560.67 rows =416667 width=8) (actual time =0.025..44.337 rows =333333 loops=3)

               -> Hash (cost=1.12..1.12 rows =1 width=4) (actual time =0.026..0.027 rows =2 loops=3)

                  Buckets: 1024 Batches: 1 Memory Usage: 9kB

                  -> Seq Scan on pgbench_branches b (cost=0.00..1.12 rows =1 width=4) (actual time =0.019..0.020 rows =2 loops=3)

                     Filter: (bbalance > 0)

                     Rows Removed by Filter: 8

  Planning Time : 0.231 ms

  Execution Time : 147.180 ms

(22 rows )

postgres=#

NOT IN 和 <> ALL生成执行计划都包含了一个子查询。他们是各自独立的。

而NOT EXISTS和LEFT JOIN生成了相同的执行计划。

这些hash连接(或hash anti join)是完成查询要求的最灵活的方式。这也是推荐exists或join的原因。因此,推荐使用exists或join的经验法则是有效的。

但是,我们继续往下看! 即使有了子查询执行计划,NOT IN子句的执行时间也会更好?

是的。PostgreSQL做了出色的优化,PostgreSQL将子查询计划进行了hash处理。因此PostgreSQL对如何处理IN子句有了更好的理解,这是一种逻辑思维方式,因为很多人倾向于使用IN子句。子查询返回的行很少,但即使子查询返回几百行,也会发生同样的情况。

但是,如果子查询返回大量行(几十万行)怎么办?让我们尝试一个简单的测试:

?

1

2

3

4

5

6

7

8

9

10

11

12

13

CREATE TABLE t1 AS

SELECT * FROM generate_series(0, 500000) id;

 

CREATE TABLE t2 AS

SELECT (random() * 4000000):: integer id

FROM generate_series(0, 4000000);

 

ANALYZE t1;

ANALYZE t2;

 

EXPLAIN SELECT id

FROM t1

WHERE id NOT IN ( SELECT id FROM t2);

执行计划:

?

1

2

3

4

5

6

7

8

9

10

11

12

     QUERY PLAN                

--------------------------------------------------------------------------------

  Gather (cost=1000.00..15195064853.01 rows =250000 width=4)

   Workers Planned: 1

   -> Parallel Seq Scan on t1 (cost=0.00..15195038853.01 rows =147059 width=4)

      Filter: ( NOT (SubPlan 1))

      SubPlan 1

       -> Materialize (cost=0.00..93326.01 rows =4000001 width=4)

          -> Seq Scan on t2 (cost=0.00..57700.01 rows =4000001 width=4)

(7 rows )

 

postgres=#

这里,执行计划将子查询进行了物化。代价评估变成了15195038853.01。(PostgreSQL的默认设置,如果t2表的行低于100k,会将子查询进行hash)。这样就会严重影响性能。因此,对于那种子查询返回的行数很少的场景,IN子句可以起到很好的作用。

其它注意点

有的!在我们用不同的方式写查询的时候,可能有数据类型的转换。

比如,语句:

?

1

EXPLAIN ANALYZE SELECT * FROM emp WHERE gen = ANY (ARRAY[ 'M' , 'F' ]);

就会发生隐式的类型转换:

?

1

2

Seq Scan on emp (cost=0.00..1.04 rows =2 width=43) (actual time =0.023..0.026 rows =3 loops=1)

  Filter: ((gen)::text = ANY ( '{M,F}' ::text[]))

这里的(gen)::text就发生了类型转换。如果在大表上,这种类型转换的代价会很高,因此,PostgreSQL对IN子句做了更好的处理。

?

1

2

3

4

EXPLAIN ANALYZE SELECT * FROM emp WHERE gen IN ( 'M' , 'F' );

 

  Seq Scan on emp (cost=0.00..1.04 rows =3 width=43) (actual time =0.030..0.034 rows =3 loops=1)

   Filter: (gen = ANY ( '{M,F}' ::bpchar[]))

将IN子句转换成了ANY子句,没有对gen列进行类型转换。而是将M\F转成了bpchar(内部等价于char)

总结

简单来说,exists和直接join表通常比较好。

很多情况下,PostgreSQL将IN子句换成被hash的子计划。在一些特殊场景下,IN可以获得更好的执行计划。

以上为个人经验,希望能给大家一个参考,也希望大家多多支持服务器之家。如有错误或未考虑完全的地方,望不吝赐教。

原文链接:https://www.cnblogs.com/abclife/p/13940948.html

查看更多关于PostgreSQL 对IN,EXISTS,ANY/ALL,JOIN的sql优化方案的详细内容...

  阅读:31次