PostgreSQL 对IN,EXISTS,ANY/ALL,JOIN的sql优化方案

测试环境：

postgres=# select version();

version

---------------------------------------------------------------------------------------------------------

PostgreSQL 11.9 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-39), 64- bit

(1 row)

postgres=#

数据准备：

1	$ pgbench -i -s 10

								
									 postgres=# \d 

									          List   of   relations 

									    Schema   |      Name      | Type | Owner  

									 --------+------------------+-------+---------- 

									    public   | pgbench_accounts |   table   | postgres 

									    public   | pgbench_branches |   table   | postgres 

									    public   | pgbench_history |   table   | postgres 

									    public   | pgbench_tellers |   table   | postgres 

									 (4   rows  ) 

									 postgres=#   select   *   from   pgbench_accounts limit 1; 

									    aid | bid | abalance |                    filler                     

									 -----+-----+----------+-------------------------------------------------------------------------------------- 

									     1 |  1 |    0 |                                           

									 (1 row) 

									 postgres=#   select   *   from   pgbench_branches limit 1; 

									    bid | bbalance | filler 

									 -----+----------+-------- 

									     1 |    0 | 

									 (1 row) 

									 postgres=#   select   *   from   pgbench_history limit 1; 

									    tid | bid | aid | delta | mtime | filler 

									 -----+-----+-----+-------+-------+-------- 

									 (0   rows  ) 

									 postgres=#   select   *   from   pgbench_tellers limit 1; 

									    tid | bid | tbalance | filler 

									 -----+-----+----------+-------- 

									     1 |  1 |    0 | 

									 (1 row) 

									 postgres=#   select   *   from   pgbench_branches; 

									    bid | bbalance | filler 

									 -----+----------+-------- 

									     1 |    0 | 

									     2 |    0 | 

									     3 |    0 | 

									     4 |    0 | 

									     5 |    0 | 

									     6 |    0 | 

									     7 |    0 | 

									     8 |    0 | 

									     9 |    0 | 

									    10 |    0 | 

									 (10   rows  ) 

									 postgres=#   update   pgbench_branches   set   bbalance=4500000   where   bid   in   (4,7); 

									 UPDATE   2 

									 postgres=#

IN语句

查询要求：找出那些余额(balance)大于0的每个分支(branch)在表在pgbench_accounts中有多少个账户

1.使用IN子句　

SELECT

count ( aid ),bid

FROM

pgbench_accounts

WHERE

bid IN ( SELECT bid FROM pgbench_branches WHERE bbalance > 0 )

GROUP BY

bid;

2.使用ANY子句

SELECT

count ( aid ),bid

FROM

pgbench_accounts

WHERE

bid = ANY ( SELECT bid FROM pgbench_branches WHERE bbalance > 0 )

GROUP BY

bid;

3.使用EXISTS子句

SELECT

count ( aid ),bid

FROM

pgbench_accounts

WHERE

EXISTS ( SELECT bid FROM pgbench_branches WHERE bbalance > 0 AND pgbench_accounts.bid = pgbench_branches.bid )

GROUP BY

bid;

4.使用INNER JOIN

								
									 SELECT 

									     count  ( aid ),a.bid 

									 FROM 

									     pgbench_accounts a 

									     JOIN   pgbench_branches b   ON   a.bid = b.bid 

									 WHERE 

									     b.bbalance > 0 

									 GROUP   BY 

									     a.bid;

在完成这个查询要求的时候，有人可能会假设exists和inner join性能可能会更好，因为他们可以使用两表连接的逻辑和优化。而IN和ANY子句需要使用子查询。

然而，PostgreSQL(10版本之后)已经智能的足以对上面四种写法产生相同的执行计划！

所有上面的写法都会产生相同的执行计划：

								
									                                         QUERY PLAN                                       

									 ------------------------------------------------------------------------------------------------------------------------------------------------------------------ 

									    Finalize GroupAggregate (cost=23327.73..23330.26   rows  =10 width=12) (actual   time  =97.199..99.014   rows  =2 loops=1) 

									     Group   Key  : a.bid 

									     -> Gather Merge (cost=23327.73..23330.06   rows  =20 width=12) (actual   time  =97.191..99.006   rows  =6 loops=1) 

									        Workers Planned: 2 

									        Workers Launched: 2 

									        -> Sort (cost=22327.70..22327.73   rows  =10 width=12) (actual   time  =93.762..93.766   rows  =2 loops=3) 

									           Sort   Key  : a.bid 

									           Sort Method: quicksort Memory: 25kB 

									           Worker 0: Sort Method: quicksort Memory: 25kB 

									           Worker 1: Sort Method: quicksort Memory: 25kB 

									           ->   Partial   HashAggregate (cost=22327.44..22327.54   rows  =10 width=12) (actual   time  =93.723..93.727   rows  =2 loops=3) 

									              Group   Key  : a.bid 

									              -> Hash   Join   (cost=1.14..22119.10   rows  =41667 width=8) (actual   time  =24.024..83.263   rows  =66667 loops=3) 

									                 Hash Cond: (a.bid = b.bid) 

									                 -> Parallel Seq Scan   on   pgbench_accounts a (cost=0.00..20560.67   rows  =416667 width=8) (actual   time  =0.023..43.151   rows  =333333 loops=3) 

									                 -> Hash (cost=1.12..1.12   rows  =1 width=4) (actual   time  =0.027..0.028   rows  =2 loops=3) 

									                    Buckets: 1024 Batches: 1 Memory Usage: 9kB 

									                    -> Seq Scan   on   pgbench_branches b (cost=0.00..1.12   rows  =1 width=4) (actual   time  =0.018..0.020   rows  =2 loops=3) 

									                       Filter: (bbalance > 0) 

									                       Rows   Removed   by   Filter: 8 

									    Planning   Time  : 0.342 ms 

									    Execution   Time  : 99.164 ms 

									 (22   rows  )

那么，我们是否可以得出这样的结论:我们可以随意地编写查询，而PostgreSQL的智能将会处理其余的问题?！

等等!

如果我们考虑排除情况，事情会变得不同。

排除查询

查询要求：找出那些余额(balance)不大于0的每个分支(branch)在表在pgbench_accounts中有多少个账户

1.使用NOT IN

SELECT

count ( aid ),bid

FROM

pgbench_accounts

WHERE

bid NOT IN ( SELECT bid FROM pgbench_branches WHERE bbalance > 0 )

GROUP BY

bid;

执行计划:

								
									                                       QUERY PLAN                                     

									 ---------------------------------------------------------------------------------------------------------------------------------------------------------- 

									    Finalize GroupAggregate (cost=23645.42..23647.95   rows  =10 width=12) (actual   time  =128.606..130.502   rows  =8 loops=1) 

									     Group   Key  : pgbench_accounts.bid 

									     -> Gather Merge (cost=23645.42..23647.75   rows  =20 width=12) (actual   time  =128.598..130.490   rows  =24 loops=1) 

									        Workers Planned: 2 

									        Workers Launched: 2 

									        -> Sort (cost=22645.39..22645.42   rows  =10 width=12) (actual   time  =124.960..124.963   rows  =8 loops=3) 

									           Sort   Key  : pgbench_accounts.bid 

									           Sort Method: quicksort Memory: 25kB 

									           Worker 0: Sort Method: quicksort Memory: 25kB 

									           Worker 1: Sort Method: quicksort Memory: 25kB 

									           ->   Partial   HashAggregate (cost=22645.13..22645.23   rows  =10 width=12) (actual   time  =124.917..124.920   rows  =8 loops=3) 

									              Group   Key  : pgbench_accounts.bid 

									              -> Parallel Seq Scan   on   pgbench_accounts (cost=1.13..21603.46   rows  =208333 width=8) (actual   time  =0.078..83.134   rows  =266667 loops=3) 

									                 Filter: (  NOT   (hashed SubPlan 1)) 

									                 Rows   Removed   by   Filter: 66667 

									                 SubPlan 1 

									                  -> Seq Scan   on   pgbench_branches (cost=0.00..1.12   rows  =1 width=4) (actual   time  =0.020..0.021   rows  =2 loops=3) 

									                     Filter: (bbalance > 0) 

									                     Rows   Removed   by   Filter: 8 

									    Planning   Time  : 0.310 ms 

									    Execution   Time  : 130.620 ms 

									 (21   rows  ) 

									 postgres=#

2.使用<>ALL

SELECT

count ( aid ),bid

FROM

pgbench_accounts

WHERE

bid <> ALL ( SELECT bid FROM pgbench_branches WHERE bbalance > 0 )

GROUP BY

bid;

执行计划:

								
									                                        QUERY PLAN                                     

									 ------------------------------------------------------------------------------------------------------------------------------------------------------------ 

									    Finalize GroupAggregate (cost=259581.79..259584.32   rows  =10 width=12) (actual   time  =418.220..419.913   rows  =8 loops=1) 

									     Group   Key  : pgbench_accounts.bid 

									     -> Gather Merge (cost=259581.79..259584.12   rows  =20 width=12) (actual   time  =418.212..419.902   rows  =24 loops=1) 

									        Workers Planned: 2 

									        Workers Launched: 2 

									        -> Sort (cost=258581.76..258581.79   rows  =10 width=12) (actual   time  =413.906..413.909   rows  =8 loops=3) 

									           Sort   Key  : pgbench_accounts.bid 

									           Sort Method: quicksort Memory: 25kB 

									           Worker 0: Sort Method: quicksort Memory: 25kB 

									           Worker 1: Sort Method: quicksort Memory: 25kB 

									           ->   Partial   HashAggregate (cost=258581.50..258581.60   rows  =10 width=12) (actual   time  =413.872..413.875   rows  =8 loops=3) 

									              Group   Key  : pgbench_accounts.bid 

									              -> Parallel Seq Scan   on   pgbench_accounts (cost=0.00..257539.83   rows  =208333 width=8) (actual   time  =0.054..367.244   rows  =266667 loops=3) 

									                 Filter: (SubPlan 1) 

									                 Rows   Removed   by   Filter: 66667 

									                 SubPlan 1 

									                  -> Materialize (cost=0.00..1.13   rows  =1 width=4) (actual   time  =0.000..0.001   rows  =2 loops=1000000) 

									                     -> Seq Scan   on   pgbench_branches (cost=0.00..1.12   rows  =1 width=4) (actual   time  =0.001..0.001   rows  =2 loops=337880) 

									                        Filter: (bbalance > 0) 

									                        Rows   Removed   by   Filter: 8 

									    Planning   Time  : 0.218 ms 

									    Execution   Time  : 420.035 ms 

									 (22   rows  )  

									 postgres=#

3.使用NOT EXISTS

SELECT

count ( aid ),bid

FROM

pgbench_accounts

WHERE

NOT EXISTS ( SELECT bid FROM pgbench_branches WHERE bbalance > 0 AND pgbench_accounts.bid = pgbench_branches.bid )

GROUP BY

bid;

执行计划:

								
									                                         QUERY PLAN                                      

									 ---------------------------------------------------------------------------------------------------------------------------------------------------------------- 

									    Finalize GroupAggregate (cost=28327.72..28330.25   rows  =10 width=12) (actual   time  =152.024..153.931   rows  =8 loops=1) 

									     Group   Key  : pgbench_accounts.bid 

									     -> Gather Merge (cost=28327.72..28330.05   rows  =20 width=12) (actual   time  =152.014..153.917   rows  =24 loops=1) 

									        Workers Planned: 2 

									        Workers Launched: 2 

									        -> Sort (cost=27327.70..27327.72   rows  =10 width=12) (actual   time  =147.782..147.786   rows  =8 loops=3) 

									           Sort   Key  : pgbench_accounts.bid 

									           Sort Method: quicksort Memory: 25kB 

									           Worker 0: Sort Method: quicksort Memory: 25kB 

									           Worker 1: Sort Method: quicksort Memory: 25kB 

									           ->   Partial   HashAggregate (cost=27327.43..27327.53   rows  =10 width=12) (actual   time  =147.732..147.737   rows  =8 loops=3) 

									              Group   Key  : pgbench_accounts.bid 

									              -> Hash Anti   Join   (cost=1.14..25452.43   rows  =375000 width=8) (actual   time  =0.134..101.884   rows  =266667 loops=3) 

									                 Hash Cond: (pgbench_accounts.bid = pgbench_branches.bid) 

									                 -> Parallel Seq Scan   on   pgbench_accounts (cost=0.00..20560.67   rows  =416667 width=8) (actual   time  =0.032..45.174   rows  =333333 loops=3) 

									                 -> Hash (cost=1.12..1.12   rows  =1 width=4) (actual   time  =0.036..0.037   rows  =2 loops=3) 

									                    Buckets: 1024 Batches: 1 Memory Usage: 9kB 

									                    -> Seq Scan   on   pgbench_branches (cost=0.00..1.12   rows  =1 width=4) (actual   time  =0.025..0.027   rows  =2 loops=3) 

									                       Filter: (bbalance > 0) 

									                       Rows   Removed   by   Filter: 8 

									    Planning   Time  : 0.322 ms 

									    Execution   Time  : 154.040 ms 

									 (22   rows  )  

									 postgres=#

4.使用LEFT JOIN和IS NULL

								
									 SELECT 

									     count  ( aid ),a.bid 

									 FROM 

									     pgbench_accounts a 

									     LEFT   JOIN   pgbench_branches b   ON   a.bid = b.bid   AND   b.bbalance > 0 

									 WHERE 

									     b.bid   IS   NULL 

									 GROUP   BY 

									     a.bid;

执行计划:

								
									                                         QUERY PLAN                                       

									 ------------------------------------------------------------------------------------------------------------------------------------------------------------------ 

									    Finalize GroupAggregate (cost=28327.72..28330.25   rows  =10 width=12) (actual   time  =145.298..147.096   rows  =8 loops=1) 

									     Group   Key  : a.bid 

									     -> Gather Merge (cost=28327.72..28330.05   rows  =20 width=12) (actual   time  =145.288..147.083   rows  =24 loops=1) 

									        Workers Planned: 2 

									        Workers Launched: 2 

									        -> Sort (cost=27327.70..27327.72   rows  =10 width=12) (actual   time  =141.883..141.887   rows  =8 loops=3) 

									           Sort   Key  : a.bid 

									           Sort Method: quicksort Memory: 25kB 

									           Worker 0: Sort Method: quicksort Memory: 25kB 

									           Worker 1: Sort Method: quicksort Memory: 25kB 

									           ->   Partial   HashAggregate (cost=27327.43..27327.53   rows  =10 width=12) (actual   time  =141.842..141.847   rows  =8 loops=3) 

									              Group   Key  : a.bid 

									              -> Hash Anti   Join   (cost=1.14..25452.43   rows  =375000 width=8) (actual   time  =0.087..99.535   rows  =266667 loops=3) 

									                 Hash Cond: (a.bid = b.bid) 

									                 -> Parallel Seq Scan   on   pgbench_accounts a (cost=0.00..20560.67   rows  =416667 width=8) (actual   time  =0.025..44.337   rows  =333333 loops=3) 

									                 -> Hash (cost=1.12..1.12   rows  =1 width=4) (actual   time  =0.026..0.027   rows  =2 loops=3) 

									                    Buckets: 1024 Batches: 1 Memory Usage: 9kB 

									                    -> Seq Scan   on   pgbench_branches b (cost=0.00..1.12   rows  =1 width=4) (actual   time  =0.019..0.020   rows  =2 loops=3) 

									                       Filter: (bbalance > 0) 

									                       Rows   Removed   by   Filter: 8 

									    Planning   Time  : 0.231 ms 

									    Execution   Time  : 147.180 ms 

									 (22   rows  )  

									 postgres=#

NOT IN 和 <> ALL生成执行计划都包含了一个子查询。他们是各自独立的。

而NOT EXISTS和LEFT JOIN生成了相同的执行计划。

这些hash连接(或hash anti join)是完成查询要求的最灵活的方式。这也是推荐exists或join的原因。因此，推荐使用exists或join的经验法则是有效的。

但是，我们继续往下看! 即使有了子查询执行计划，NOT IN子句的执行时间也会更好?

是的。PostgreSQL做了出色的优化，PostgreSQL将子查询计划进行了hash处理。因此PostgreSQL对如何处理IN子句有了更好的理解，这是一种逻辑思维方式，因为很多人倾向于使用IN子句。子查询返回的行很少，但即使子查询返回几百行，也会发生同样的情况。

但是，如果子查询返回大量行(几十万行)怎么办?让我们尝试一个简单的测试：

								
									 CREATE   TABLE   t1   AS 

									 SELECT   *   FROM   generate_series(0, 500000) id; 

									 CREATE   TABLE   t2   AS 

									 SELECT   (random() * 4000000)::  integer   id 

									 FROM   generate_series(0, 4000000); 

									 ANALYZE t1; 

									 ANALYZE t2; 

									 EXPLAIN   SELECT   id 

									 FROM   t1 

									 WHERE   id   NOT   IN   (  SELECT   id   FROM   t2);

执行计划：

								
									       QUERY PLAN                  

									 -------------------------------------------------------------------------------- 

									    Gather (cost=1000.00..15195064853.01   rows  =250000 width=4) 

									     Workers Planned: 1 

									     -> Parallel Seq Scan   on   t1 (cost=0.00..15195038853.01   rows  =147059 width=4) 

									        Filter: (  NOT   (SubPlan 1)) 

									        SubPlan 1 

									         -> Materialize (cost=0.00..93326.01   rows  =4000001 width=4) 

									            -> Seq Scan   on   t2 (cost=0.00..57700.01   rows  =4000001 width=4) 

									 (7   rows  ) 

									 postgres=#

这里，执行计划将子查询进行了物化。代价评估变成了15195038853.01。(PostgreSQL的默认设置，如果t2表的行低于100k，会将子查询进行hash)。这样就会严重影响性能。因此，对于那种子查询返回的行数很少的场景，IN子句可以起到很好的作用。

其它注意点

有的！在我们用不同的方式写查询的时候，可能有数据类型的转换。

比如，语句：

1	EXPLAIN ANALYZE SELECT * FROM emp WHERE gen = ANY (ARRAY[ 'M' , 'F' ]);

就会发生隐式的类型转换：

1 2	Seq Scan on emp (cost=0.00..1.04 rows =2 width=43) (actual time =0.023..0.026 rows =3 loops=1) Filter: ((gen)::text = ANY ( '{M,F}' ::text[]))

这里的(gen)::text就发生了类型转换。如果在大表上，这种类型转换的代价会很高，因此，PostgreSQL对IN子句做了更好的处理。

EXPLAIN ANALYZE SELECT * FROM emp WHERE gen IN ( 'M' , 'F' );

Seq Scan on emp (cost=0.00..1.04 rows =3 width=43) (actual time =0.030..0.034 rows =3 loops=1)

Filter: (gen = ANY ( '{M,F}' ::bpchar[]))

将IN子句转换成了ANY子句，没有对gen列进行类型转换。而是将M\F转成了bpchar(内部等价于char)

总结

简单来说，exists和直接join表通常比较好。

很多情况下，PostgreSQL将IN子句换成被hash的子计划。在一些特殊场景下，IN可以获得更好的执行计划。

以上为个人经验，希望能给大家一个参考，也希望大家多多支持服务器之家。如有错误或未考虑完全的地方，望不吝赐教。

原文链接：https://www.cnblogs.com/abclife/p/13940948.html

声明：本文来自网络，不代表【好得很程序员自学网】立场，转载请注明出处：http://www.haodehen.cn/did229579

更新时间：2023-05-14 阅读：38次