postgreSQL 非count方法算记录数操作

一般方法

select count (1) from table_name;

全量扫描一遍表，记录越多，查询速度越慢

新法

PostgreSQL 还真提供了一个这样的途径，那就是系统表 pg_class，这个系统表里头，存储着每个表的统计信息，其中 reltuples 就是对应的表的统计行,统计行的数据是pg有个独立进程，定期扫描不同的表，收集这些表的统计信息，保存在系统表里头。

方法如下：

select

reltuples:: int as total

from

pg_class

where

relname = 'table_name'

and relnamespace = ( select oid from pg_namespace where nspname = 'schema' );

新方法不是通用的，如果要求特精确还是使用select count(1),如果是类似分页的，且分页数量超过比较多的，也不是要求特别精准的，这就是一个好方法！

count(1) over 计算记录数

select count(1) over(), * from table_name;

补充

count 是最常用的聚集函数之一，看似简单，其实还是有坑的，如：

1、count(*)：返回结果集的行数，是null也统计

2、count(1)：和count(*)基本没区别，pg92之前都是扫描全表的，pg92之后增加了index only scan一般会变成扫主键索引，如果没有主键或者是表的列很多的情况下，count(1)快一些，因为不会考虑表的全部字段

3、count(field)：返回数据表中指定字段值不等于null的行数

拓展：理解 PostgreSQL 的 count 函数的行为

关于 count 函数的使用一直存在争议，尤其是在 MySQL 中，作为流行度越来越高的 PostgreSQL 是否也有类似的问题呢，我们通过实践来理解一下 PostgreSQL 中 count 函数的行为。

构建测试数据库

创建测试数据库，并创建测试表。测试表中有自增 ID、创建时间、内容三个字段，自增 ID 字段是主键。

create database performance_test;

create table test_tbl (id serial primary key, created_at timestamp, content varchar(512));

生成测试数据

使用 generate_series 函数生成自增 ID，使用 now() 函数生成 created_at 列，对于 content 列，使用了 repeat(md5(random()::text), 10) 生成 10 个 32 位长度的 md5 字符串。使用下列语句，插入 1000w 条记录用于测试。

performance_test=# insert into test_tbl select generate_series(1,10000000),now(),repeat(md5(random()::text),10); INSERT 0 10000000 Time: 212184.223 ms (03:32.184)

由 count 语句引发的思考

默认情况下 PostgreSQL 不开启 SQL 执行时间的显示，所以需要手动开启一下，方便后面的测试对比。

\timing on

count(*) 和 count(1) 的性能区别是经常被讨论的问题，分别使用 count(*) 和 count(1) 执行一次查询。

								
									 performance_test=#   select   count  (*)   from   test_tbl; 

									    count 

									 ---------- 

									    10000000 

									 (1 row) 

									 Time  : 115090.380 ms (01:55.090) 

									 performance_test=#   select   count  (1)   from   test_tbl; 

									    count 

									 ---------- 

									    10000000 

									 (1 row) 

									 Time  : 738.502 ms

可以看到两次查询的速度差别非常大，count(1) 真的有这么大的性能提升？接下来再次运行查询语句。

								
									 performance_test=#   select   count  (*)   from   test_tbl; 

									    count 

									 ---------- 

									    10000000 

									 (1 row) 

									 Time  : 657.831 ms 

									 performance_test=#   select   count  (1)   from   test_tbl; 

									    count 

									 ---------- 

									    10000000 

									 (1 row) 

									 Time  : 682.157 ms

可以看到第一次查询时候会非常的慢，后面三次速度非常快并且时间相近，这里就有两个问题出现了：

为什么第一次查询速度这么慢？

count(*) 和 count(1) 到底存不存在性能差别？

查询缓存

使用 explain 语句重新执行查询语句

explain (analyze,buffers,verbose) select count(*) from test_tbl;

可以看到如下输出：

								
									 Finalize Aggregate (cost=529273.69..529273.70   rows  =1 width=8) (actual   time  =882.569..882.570   rows  =1 loops=1) 

									     Output  :   count  (*) 

									     Buffers: shared hit=96   read  =476095 

									     -> Gather (cost=529273.48..529273.69   rows  =2 width=8) (actual   time  =882.492..884.170   rows  =3 loops=1) 

									        Output  : (  PARTIAL   count  (*)) 

									        Workers Planned: 2 

									        Workers Launched: 2 

									        Buffers: shared hit=96   read  =476095 

									        ->   Partial   Aggregate (cost=528273.48..528273.49   rows  =1 width=8) (actual   time  =881.014..881.014   rows  =1 loops=3) 

									           Output  :   PARTIAL   count  (*) 

									           Buffers: shared hit=96   read  =476095 

									           Worker 0: actual   time  =880.319..880.319   rows  =1 loops=1 

									            Buffers: shared hit=34   read  =158206 

									           Worker 1: actual   time  =880.369..880.369   rows  =1 loops=1 

									            Buffers: shared hit=29   read  =156424 

									           -> Parallel Seq Scan   on   public  .test_tbl (cost=0.00..517856.98   rows  =4166598 width=0) (actual   time  =0.029..662.165   rows  =3333333 loops=3) 

									              Buffers: shared hit=96   read  =476095 

									              Worker 0: actual   time  =0.026..661.807   rows  =3323029 loops=1 

									               Buffers: shared hit=34   read  =158206 

									              Worker 1: actual   time  =0.030..660.197   rows  =3285513 loops=1 

									               Buffers: shared hit=29   read  =156424 

									    Planning   time  : 0.043 ms 

									    Execution   time  : 884.207 ms

注意里面的 shared hit，表示命中了内存中缓存的数据，这就可以解释为什么后面的查询会比第一次快很多。接下来去掉缓存，并重启 PostgreSQL。

service postgresql stop

echo 1 > /proc/sys/vm/drop_caches

service postgresql start

重新执行 SQL 语句，速度慢了很多。

								
									 Finalize Aggregate (cost=529273.69..529273.70   rows  =1 width=8) (actual   time  =50604.564..50604.564   rows  =1 loops=1) 

									    Output  :   count  (*) 

									    Buffers: shared   read  =476191 

									    -> Gather (cost=529273.48..529273.69   rows  =2 width=8) (actual   time  =50604.508..50606.141   rows  =3 loops=1) 

									       Output  : (  PARTIAL   count  (*)) 

									       Workers Planned: 2 

									       Workers Launched: 2 

									       Buffers: shared   read  =476191 

									       ->   Partial   Aggregate (cost=528273.48..528273.49   rows  =1 width=8) (actual   time  =50591.550..50591.551   rows  =1 loops=3) 

									          Output  :   PARTIAL   count  (*) 

									          Buffers: shared   read  =476191 

									          Worker 0: actual   time  =50585.182..50585.182   rows  =1 loops=1 

									           Buffers: shared   read  =158122 

									          Worker 1: actual   time  =50585.181..50585.181   rows  =1 loops=1 

									           Buffers: shared   read  =161123 

									          -> Parallel Seq Scan   on   public  .test_tbl (cost=0.00..517856.98   rows  =4166598 width=0) (actual   time  =92.491..50369.691   rows  =3333333 loops=3) 

									             Buffers: shared   read  =476191 

									             Worker 0: actual   time  =122.170..50362.271   rows  =3320562 loops=1 

									              Buffers: shared   read  =158122 

									             Worker 1: actual   time  =14.020..50359.733   rows  =3383583 loops=1 

									              Buffers: shared   read  =161123 

									 Planning   time  : 11.537 ms 

									 Execution   time  : 50606.215 ms

shared read 表示没有命中缓存，通过这个现象可以推断出，上一小节的四次查询中，第一次查询没有命中缓存，剩下三次查询都命中了缓存。

count(1) 和 count(*) 的区别

接下来探究 count(1) 和 count(*) 的区别是什么，继续思考最开始的四次查询，第一次查询使用了 count(*)，第二次查询使用了 count(1) ，却依然命中了缓存，不正是说明 count(1) 和 count(*) 是一样的吗？

事实上，PostgreSQL 官方对于 is there a difference performance-wise between select count(1) and select count(*)? 问题的回复也证实了这一点：

Nope. In fact, the latter is converted to the former during parsing.[2]

既然 count(1) 在性能上没有比 count(*) 更好，那么使用 count(*) 就是更好的选择。

sequence scan 和 index scan

接下来测试一下，在不同数据量大小的情况下 count(*) 的速度，将查询语句写在 count.sql 文件中，使用 pgbench 进行测试。

pgbench -c 5 -t 20 performance_test -r -f count.sql

分别测试 200w - 1000w 数据量下的 count 语句耗时

数据大小 count耗时(ms)

200w	738.758
300w	1035.846
400w	1426.183
500w	1799.866
600w	2117.247
700w	2514.691
800w	2526.441
900w	2568.240
1000w	2650.434

绘制成耗时曲线

曲线的趋势在 600w - 700w 数据量之间出现了转折，200w - 600w 是线性增长，600w 之后 count 的耗时就基本相同了。使用 explain 语句分别查看 600w 和 700w 数据时的 count 语句执行。

700w：

								
									 Finalize Aggregate (cost=502185.93..502185.94   rows  =1 width=8) (actual   time  =894.361..894.361   rows  =1 loops=1) 

									     Output  :   count  (*) 

									     Buffers: shared hit=16344   read  =352463 

									     -> Gather (cost=502185.72..502185.93   rows  =2 width=8) (actual   time  =894.232..899.763   rows  =3 loops=1) 

									        Output  : (  PARTIAL   count  (*)) 

									        Workers Planned: 2 

									        Workers Launched: 2 

									        Buffers: shared hit=16344   read  =352463 

									        ->   Partial   Aggregate (cost=501185.72..501185.73   rows  =1 width=8) (actual   time  =889.371..889.371   rows  =1 loops=3) 

									           Output  :   PARTIAL   count  (*) 

									           Buffers: shared hit=16344   read  =352463 

									           Worker 0: actual   time  =887.112..887.112   rows  =1 loops=1 

									            Buffers: shared hit=5459   read  =118070 

									           Worker 1: actual   time  =887.120..887.120   rows  =1 loops=1 

									            Buffers: shared hit=5601   read  =117051 

									           -> Parallel   Index   Only   Scan using test_tbl_pkey   on   public  .test_tbl (cost=0.43..493863.32   rows  =2928960 width=0) (actual   time  =0.112..736.376   rows  =2333333 loops=3) 

									              Index   Cond: (test_tbl.id < 7000000) 

									              Heap Fetches: 2328492 

									              Buffers: shared hit=16344   read  =352463 

									              Worker 0: actual   time  =0.107..737.180   rows  =2344479 loops=1 

									               Buffers: shared hit=5459   read  =118070 

									              Worker 1: actual   time  =0.133..737.960   rows  =2327028 loops=1 

									               Buffers: shared hit=5601   read  =117051 

									    Planning   time  : 0.165 ms 

									    Execution   time  : 899.857 ms

600w：

								
									 Finalize Aggregate (cost=429990.94..429990.95   rows  =1 width=8) (actual   time  =765.575..765.575   rows  =1 loops=1) 

									     Output  :   count  (*) 

									     Buffers: shared hit=13999   read  =302112 

									     -> Gather (cost=429990.72..429990.93   rows  =2 width=8) (actual   time  =765.557..770.889   rows  =3 loops=1) 

									        Output  : (  PARTIAL   count  (*)) 

									        Workers Planned: 2 

									        Workers Launched: 2 

									        Buffers: shared hit=13999   read  =302112 

									        ->   Partial   Aggregate (cost=428990.72..428990.73   rows  =1 width=8) (actual   time  =763.821..763.821   rows  =1 loops=3) 

									           Output  :   PARTIAL   count  (*) 

									           Buffers: shared hit=13999   read  =302112 

									           Worker 0: actual   time  =762.742..762.742   rows  =1 loops=1 

									            Buffers: shared hit=4638   read  =98875 

									           Worker 1: actual   time  =763.308..763.308   rows  =1 loops=1 

									            Buffers: shared hit=4696   read  =101570 

									           -> Parallel   Index   Only   Scan using test_tbl_pkey   on   public  .test_tbl (cost=0.43..422723.16   rows  =2507026 width=0) (actual   time  =0.053..632.199   rows  =2000000 loops=3) 

									              Index   Cond: (test_tbl.id < 6000000) 

									              Heap Fetches: 2018490 

									              Buffers: shared hit=13999   read  =302112 

									              Worker 0: actual   time  =0.059..633.156   rows  =1964483 loops=1 

									               Buffers: shared hit=4638   read  =98875 

									              Worker 1: actual   time  =0.038..634.271   rows  =2017026 loops=1 

									               Buffers: shared hit=4696   read  =101570 

									    Planning   time  : 0.055 ms 

									    Execution   time  : 770.921 ms

根据以上现象推断，PostgreSQL 似乎在 count 的数据量小于数据表长度的某一比例时，才使用 index scan，通过查看官方 wiki 也可以看到相关描述：

It is important to realise that the planner is concerned with minimising the total cost of the query. With databases, the cost of I/O typically dominates. For that reason, "count(*) without any predicate" queries will only use an index-only scan if the index is significantly smaller than its table. This typically only happens when the table's row width is much wider than some indexes'.[3]

根据 Stackoverflow 上的回答，count 语句查询的数量大于表大小的 3/4 时候就会用使用全表扫描代替索引扫描[4]。

结论

不要用 count(1) 或 count(列名) 代替 count(*)

count 本身是非常耗时的

count 可能是 index scan 也可能是 sequence scan，取决于 count 数量占表大小的比例

以上为个人经验，希望能给大家一个参考，也希望大家多多支持服务器之家。如有错误或未考虑完全的地方，望不吝赐教。

原文链接：https://blog.csdn.net/u011944141/article/details/52485132

查看更多关于postgreSQL 非count方法算记录数操作的详细内容...

声明：本文来自网络，不代表【好得很程序员自学网】立场，转载请注明出处：http://www.haodehen.cn/did229736

更新时间：2023-05-14 阅读：74次