postgresql分页数据重复问题的深入理解

问题背景

许多开发和测试人员都可能遇到过列表的数据翻下一页的时候显示了上一页的数据，也就是翻页会有重复的数据。

如何处理？

这个问题出现的原因是因为选择的排序字段有重复，常见的处理办法就是排序的时候加上唯一字段，这样在分页的过程中数据就不会重复了。关于这个问题文档也有解释并非是一个bug。而是排序时需要选择唯一字段来做排序，不然返回的结果不确定

排序返回数据重复的根本原因是什么呢？

经常优化sql的同学可能会发现，执行计划里面会有Sort Method这个关键字，而这个关键字就是排序选择的方法。abase的排序分为三种

quicksort                       快速排序
top-N heapsort Memory          堆排序
external merge Disk            归并排序

推测

分页重复的问题和执行计划选择排序算法的稳定性有关。

简单介绍下这三种排序算法的场景：

在有索引的情况下：排序可以直接走索引。在没有索引的情况下：当表的数据量较小的时候选择快速排序（排序所需必须内存小于work_mem），当排序有limit，且耗费的内存不超过work_mem时选择堆排序，当work_mem不够时选择归并排序。

验证推测

1.创建表，初始化数据

								
									 abase=#   create   table   t_sort(n_int   int  ,c_id   varchar  (300)); 

									 CREATE   TABLE 

									 abase=#   insert   into   t_sort(n_int,c_id)   select   100,generate_series(1,9); 

									 INSERT   0 9 

									 abase=#   insert   into   t_sort(n_int,c_id)   select   200,generate_series(1,9); 

									 INSERT   0 9 

									 abase=#   insert   into   t_sort(n_int,c_id)   select   300,generate_series(1,9); 

									 INSERT   0 9 

									 abase=#   insert   into   t_sort(n_int,c_id)   select   400,generate_series(1,9); 

									 INSERT   0 9 

									 abase=#   insert   into   t_sort(n_int,c_id)   select   500,generate_series(1,9); 

									 INSERT   0 9 

									 abase=#   insert   into   t_sort(n_int,c_id)   select   600,generate_series(1,9); 

									 INSERT   0 9

三种排序

								
									 --快速排序 quicksort 

									 abase=# explain analyze   select   ctid,n_int,c_id   from   t_sort   order   by   n_int   asc  ; 

									               QUERY PLAN             

									 ------------------------------------------------------------ 

									    Sort (cost=3.09..3.23   rows  =54 width=12) (actual   time  =0.058..0.061   rows  =54 loops=1) 

									    Sort   Key  : n_int 

									    Sort Method: quicksort Memory: 27kB 

									    -> Seq Scan   on   t_sort (cost=0.00..1.54   rows  =54 width=12) (actual   time  =0.021..0.032   rows  =54 loops=1) 

									    Planning   time  : 0.161 ms 

									    Execution   time  : 0.104 ms 

									 (6   rows  ) 

									 --堆排序 top-N heapsort 

									 abase=# explain analyze   select   ctid,n_int,c_id   from   t_sort   order   by   n_int   asc   limit 10; 

									                QUERY PLAN              

									 ------------------------------------------------------------ 

									    Limit (cost=2.71..2.73   rows  =10 width=12) (actual   time  =0.066..0.068   rows  =10 loops=1) 

									    -> Sort (cost=2.71..2.84   rows  =54 width=12) (actual   time  =0.065..0.066   rows  =10 loops=1) 

									      Sort   Key  : n_int 

									      Sort Method:   top  -N heapsort Memory: 25kB 

									      -> Seq Scan   on   t_sort (cost=0.00..1.54   rows  =54 width=12) (actual   time  =0.022..0.031   rows  =54 loops=1 

									 ) 

									    Planning   time  : 0.215 ms 

									    Execution   time  : 0.124 ms 

									 (7   rows  ) 

									 --归并排序 external sort Disk 

									 --插入大量值为a的数据 

									 abase=#   insert   into   t_sort(n_int,c_id)   select   generate_series(1000,2000),  'a'  ; 

									 INSERT   0 1001 

									 abase=#   set   work_mem =   '64kB'  ; 

									 SET 

									 abase=# explain analyze   select   ctid,n_int,c_id   from   t_sort   order   by   n_int   asc  ; 

									                QUERY PLAN              

									 ------------------------------------------------------------- 

									    Sort (cost=18.60..19.28   rows  =270 width=12) (actual   time  =1.235..1.386   rows  =1055 loops=1) 

									    Sort   Key  : n_int 

									    Sort Method: external sort Disk: 32kB 

									    -> Seq Scan   on   t_sort (cost=0.00..7.70   rows  =270 width=12) (actual   time  =0.030..0.247   rows  =1055 loops=1) 

									    Planning   time  : 0.198 ms 

									    Execution   time  : 1.663 ms 

									 (6   rows  )

快速排序

								
									 --快速排序 

									 abase=# explain analyze   select   ctid,n_int,c_id   from   t_sort   order   by   n_int   asc  ; 

									               QUERY PLAN             

									 ------------------------------------------------------------ 

									    Sort (cost=3.09..3.23   rows  =54 width=12) (actual   time  =0.058..0.061   rows  =54 loops=1) 

									    Sort   Key  : n_int 

									    Sort Method: quicksort Memory: 27kB 

									    -> Seq Scan   on   t_sort (cost=0.00..1.54   rows  =54 width=12) (actual   time  =0.021..0.032   rows  =54 loops=1) 

									    Planning   time  : 0.161 ms 

									    Execution   time  : 0.104 ms 

									 (6   rows  )  

									 ? 

									 --获取前20条数据 

									    abase=#   select   ctid,n_int,c_id   from   t_sort   order   by   n_int   asc   limit 20; 

									     ctid | n_int | c_id  

									    --------+-------+------ 

									     (0,7) | 100 | 7 

									     (0,2) | 100 | 2 

									     (0,4) | 100 | 4 

									     (0,8) | 100 | 8 

									     (0,3) | 100 | 3 

									     (0,6) | 100 | 6 

									     (0,5) | 100 | 5 

									     (0,9) | 100 | 9 

									     (0,1) | 100 | 1 

									     (0,14) | 200 | 5 

									     (0,13) | 200 | 4 

									     (0,12) | 200 | 3 

									     (0,10) | 200 | 1 

									     (0,15) | 200 | 6 

									     (0,16) | 200 | 7 

									     (0,17) | 200 | 8 

									     (0,11) | 200 | 2 

									     (0,18) | 200 | 9 

									     (0,20) | 300 | 2 

									     (0,19) | 300 | 1 

									    (20   rows  )    --分页获取前10条数据 

									    abase=#   select   ctid,n_int,c_id   from   t_sort   order   by   n_int   asc   limit 10 offset 0; 

									     ctid | n_int | c_id  

									    --------+-------+------ 

									     (0,1) | 100 | 1 

									     (0,3) | 100 | 3 

									     (0,4) | 100 | 4 

									     (0,2) | 100 | 2 

									     (0,6) | 100 | 6 

									     (0,7) | 100 | 7 

									     (0,8) | 100 | 8 

									     (0,9) | 100 | 9 

									     (0,5) | 100 | 5 

									     (0,10) | 200 | 1 

									    (10   rows  ) 

									    --分页从第10条开始获取10条 

									    abase=#   select   ctid,n_int,c_id   from   t_sort   order   by   n_int   asc   limit 10 offset 10; 

									     ctid | n_int | c_id  

									    --------+-------+------ 

									     (0,13) | 200 | 4 

									     (0,12) | 200 | 3 

									     (0,10) | 200 | 1 

									     (0,15) | 200 | 6 

									     (0,16) | 200 | 7 

									     (0,17) | 200 | 8 

									     (0,11) | 200 | 2 

									     (0,18) | 200 | 9 

									     (0,20) | 300 | 2 

									     (0,19) | 300 | 1 

									    (10   rows  ) 

									 limit 10 offset 0,limit 10 offset 10，连续取两页数据

此处可以看到limit 10 offset 10结果中，第三条数据重复了第一页的最后一条： (0,10) | 200 | 1

并且n_int = 200 and c_id = 5这条数据被[遗漏]了。

堆排序

								
									 abase=#   select   count  (*)   from   t_sort; 

									    count 

									 ------- 

									    1055 

									 (1 row) 

									 --设置work_mem 4MB 

									 abase=# show work_mem ; 

									    work_mem  

									 ---------- 

									    4MB 

									 (1 row) 

									 ? 

									 --top-N heapsort 

									 abase=# explain analyze   select   *   from   (   select   ctid,n_int,c_id   from   test   order   by   n_int   asc   limit 1001 offset 0) td limit 10; 

									                 QUERY PLAN            

									 ------------------------------------------------------------------------------------------------------------- 

									    Limit (cost=2061.33..2061.45   rows  =10 width=13) (actual   time  =15.247..15.251   rows  =10 loops=1) 

									    -> Limit (cost=2061.33..2063.83   rows  =1001 width=13) (actual   time  =15.245..15.247   rows  =10 loops=1) 

									      -> Sort (cost=2061.33..2135.72   rows  =29757 width=13) (actual   time  =15.244..15.245   rows  =10 loops=1) 

									       Sort   Key  : test.n_int 

									       Sort Method:   top  -N heapsort Memory: 95kB 

									       -> Seq Scan   on   test (cost=0.00..429.57   rows  =29757 width=13) (actual   time  =0.042..7.627   rows  =2 

									 9757 loops=1) 

									    Planning   time  : 0.376 ms 

									    Execution   time  : 15.415 ms 

									 (8   rows  ) 

									 ? 

									 --获取limit 1001 offset 0，然后取10前10条数据 

									 abase=#   select   *   from   (   select   ctid,n_int,c_id   from   test   order   by   n_int   asc   limit 1001 offset 0) td limit 10; 

									    ctid | n_int | c_id  

									 ----------+-------+------ 

									    (0,6) | 100 | 6 

									    (0,2) | 100 | 2 

									    (0,5) | 100 | 5 

									    (87,195) | 100 | 888 

									    (0,3) | 100 | 3 

									    (0,1) | 100 | 1 

									    (0,8) | 100 | 8 

									    (0,55) | 100 | 888 

									    (44,12) | 100 | 888 

									    (0,9) | 100 | 9 

									 (10   rows  ) 

									 ?  ---获取limit 1001 offset 1，然后取10前10条数据 

									 abase=#   select   *   from   (   select   ctid,n_int,c_id   from   test   order   by   n_int   asc   limit 1001 offset 1) td limit 10; 

									    ctid | n_int | c_id  

									 ----------+-------+------ 

									    (44,12) | 100 | 888 

									    (0,8) | 100 | 8 

									    (0,1) | 100 | 1 

									    (0,5) | 100 | 5 

									    (0,9) | 100 | 9 

									    (87,195) | 100 | 888 

									    (0,7) | 100 | 7 

									    (0,6) | 100 | 6 

									    (0,3) | 100 | 3 

									    (0,4) | 100 | 4 

									 (10   rows  ) 

									 ---获取limit 1001 offset 2，然后取10前10条数据 

									 abase=#   select   *   from   (   select   ctid,n_int,c_id   from   test   order   by   n_int   asc   limit 1001 offset 2) td limit 10; 

									    ctid | n_int | c_id  

									 ----------+-------+------ 

									    (0,5) | 100 | 5 

									    (0,55) | 100 | 888 

									    (0,1) | 100 | 1 

									    (0,9) | 100 | 9 

									    (0,2) | 100 | 2 

									    (0,3) | 100 | 3 

									    (44,12) | 100 | 888 

									    (0,7) | 100 | 7 

									    (87,195) | 100 | 888 

									    (0,4) | 100 | 4 

									 (10   rows  )

堆排序使用内存： Sort Method: top-N heapsort Memory: 95kB

当offset从0变成1后，以及变成2后，会发现查询出的10条数据不是有顺序的。

归并排序

								
									 --将work_mem设置为64kb让其走归并排序。 

									 abase=#   set   work_mem =  '64kB'  ; 

									 SET 

									 abase=# show work_mem; 

									    work_mem  

									 ---------- 

									    64kB 

									 (1 row) 

									 ? 

									 -- external merge Disk 

									 abase=# explain analyze   select   *   from   (   select   ctid,n_int,c_id   from   test   order   by   n_int   asc   limit 1001 offset 0) td limit 10; 

									                 QUERY PLAN                

									 --------------------------------------------------------------------------------------------------------------------------- 

									    Limit (cost=2061.33..2061.45   rows  =10 width=13) (actual   time  =27.912..27.916   rows  =10 loops=1) 

									    -> Limit (cost=2061.33..2063.83   rows  =1001 width=13) (actual   time  =27.910..27.913   rows  =10 loops=1) 

									      -> Sort (cost=2061.33..2135.72   rows  =29757 width=13) (actual   time  =27.909..27.911   rows  =10 loops=1) 

									       Sort   Key  : test.n_int 

									       Sort Method: external merge Disk: 784kB 

									       -> Seq Scan   on   test (cost=0.00..429.57   rows  =29757 width=13) (actual   time  =0.024..6.730   rows  =29757 loops=1) 

									    Planning   time  : 0.218 ms 

									    Execution   time  : 28.358 ms 

									 (8   rows  ) 

									 ?  --同堆排序一样，获取limit 1001 offset 0，然后取10前10条数据 

									 abase=#   select   *   from   (   select   ctid,n_int,c_id   from   test   order   by   n_int   asc   limit 1001 offset 0) td limit 10; 

									    ctid | n_int | c_id  

									 --------+-------+------ 

									    (0,1) | 100 | 1 

									    (0,2) | 100 | 2 

									    (0,4) | 100 | 4 

									    (0,8) | 100 | 8 

									    (0,9) | 100 | 9 

									    (0,5) | 100 | 5 

									    (0,3) | 100 | 3 

									    (0,6) | 100 | 6 

									    (0,55) | 100 | 888 

									    (0,7) | 100 | 7 

									 (10   rows  ) 

									 --同堆排序一样，获取limit 1001 offset 1，然后取10前10条数据 

									 abase=#   select   *   from   (   select   ctid,n_int,c_id   from   test   order   by   n_int   asc   limit 1001 offset 1) td limit 10; 

									    ctid | n_int | c_id  

									 ----------+-------+------ 

									    (0,2) | 100 | 2 

									    (0,4) | 100 | 4 

									    (0,8) | 100 | 8 

									    (0,9) | 100 | 9 

									    (0,5) | 100 | 5 

									    (0,3) | 100 | 3 

									    (0,6) | 100 | 6 

									    (0,55) | 100 | 888 

									    (0,7) | 100 | 7 

									    (87,195) | 100 | 888 

									 (10   rows  ) 

									 --同堆排序一样，获取limit 1001 offset 2，然后取10前10条数据 

									 abase=#   select   *   from   (   select   ctid,n_int,c_id   from   test   order   by   n_int   asc   limit 1001 offset 2) td limit 10; 

									    ctid | n_int | c_id  

									 ----------+-------+------ 

									    (0,4) | 100 | 4 

									    (0,8) | 100 | 8 

									    (0,9) | 100 | 9 

									    (0,5) | 100 | 5 

									    (0,3) | 100 | 3 

									    (0,6) | 100 | 6 

									    (0,55) | 100 | 888 

									    (0,7) | 100 | 7 

									    (87,195) | 100 | 888 

									    (44,12) | 100 | 888 

									 (10   rows  )

减小work_mem使用归并排序的时候，offset从0变成1后以及变成2后，任然有序。

还有一种情况，那就是在查询前面几页的时候会有重复，但是越往后面翻就不会重复了，现在也可以解释清楚。

如果每页10条数据，当offse较小的时候使用的内存较少。当offse不断增大，所耗费的内存也就越多。

								
									 --设置work_mem =64kb 

									 abase=# show work_mem; 

									    work_mem  

									 ---------- 

									    64kB 

									 (1 row) 

									 --查询limit 10 offset 10 

									 abase=# explain analyze   select   *   from   (   select   ctid,n_int,c_id   from   test   order   by   n_int   asc   limit 10 offset 10) td limit 10; 

									                 QUERY PLAN                

									 --------------------------------------------------------------------------------------------------------------------------- 

									    Limit (cost=1221.42..1221.54   rows  =10 width=13) (actual   time  =12.881..12.884   rows  =10 loops=1) 

									    -> Limit (cost=1221.42..1221.44   rows  =10 width=13) (actual   time  =12.879..12.881   rows  =10 loops=1) 

									      -> Sort (cost=1221.39..1295.79   rows  =29757 width=13) (actual   time  =12.877..12.879   rows  =20 loops=1) 

									       Sort   Key  : test.n_int 

									       Sort Method:   top  -N heapsort Memory: 25kB 

									       -> Seq Scan   on   test (cost=0.00..429.57   rows  =29757 width=13) (actual   time  =0.058..6.363   rows  =29757 loops=1) 

									    Planning   time  : 0.230 ms 

									    Execution   time  : 12.976 ms 

									 (8   rows  ) 

									 ? 

									 --查询limit 10 offset 1000 

									 abase=# explain analyze   select   *   from   (   select   ctid,n_int,c_id   from   test   order   by   n_int   asc   limit 10 offset 1000) td limit 10; 

									                 QUERY PLAN                

									 --------------------------------------------------------------------------------------------------------------------------- 

									    Limit (cost=2065.75..2065.88   rows  =10 width=13) (actual   time  =27.188..27.192   rows  =10 loops=1) 

									    -> Limit (cost=2065.75..2065.78   rows  =10 width=13) (actual   time  =27.186..27.188   rows  =10 loops=1) 

									      -> Sort (cost=2063.25..2137.64   rows  =29757 width=13) (actual   time  =26.940..27.138   rows  =1010 loops=1) 

									       Sort   Key  : test.n_int 

									       Sort Method: external merge Disk: 784kB 

									       -> Seq Scan   on   test (cost=0.00..429.57   rows  =29757 width=13) (actual   time  =0.026..6.374   rows  =29757 loops=1) 

									    Planning   time  : 0.207 ms 

									    Execution   time  : 27.718 ms 

									 (8   rows  )

可以看到当offset从10增加到1000的时候，使用的内存增加，排序的方法从堆排序变成了归并排序。而归并排序为稳定排序，所以后面的分页不会再有后一页出现前一页数据的情况。

参考资料: PostgreSQL - repeating rows from LIMIT OFFSET

参考资料: LIMIT and OFFSET

结语

1.关于分页重复数据的问题主要是排序字段不唯一并且执行计划走了快速排序和堆排序导致。

2.当排序有重复字段，但是如果查询是归并排序，便不会存在有重复数据的问题。

3.当用重复字段排序，前面的页重复，随着offset的增大导致work_mem不足以后使用归并排序，就不存在重复的数据了。

4.排序和算法的稳定性有关，当执行计划选择不同的排序算法时，返回的结果不一样。

5.处理重复数据的常见手段就是，排序的时候可以在排序字段d_larq(立案日期)后面加上c_bh(唯一字段)来排序。

order by d_larq,c_bh;

总结

以上就是这篇文章的全部内容了，希望本文的内容对大家的学习或者工作具有一定的参考学习价值，谢谢大家对服务器之家的支持。

原文链接：http://HdhCmsTestcnblogs测试数据/zhangfx01/p/10613862.html

查看更多关于postgresql分页数据重复问题的深入理解的详细内容...

声明：本文来自网络，不代表【好得很程序员自学网】立场，转载请注明出处：http://www.haodehen.cn/did229772

更新时间：2023-05-14 阅读：93次