PostgreSQL的中文拼音排序案例

前一段时间开发人员咨询，说postgresql里面想根据一个字段做中文的拼音排序，但是不得其解

环境：

OS：CentOS 6.3

DB： PostgreSQL 9.2.4

TABLE: tbl_kenyon

场景：

postgres=# \d tbl_kenyon

Table "public.tbl_kenyon"

Column | Type | Modifiers

--------+------+---------------

vname | text |

--使用排序后的结果，不是很理想

postgres=# select vname from tbl_kenyon order by vname;

vname

-------

上海

北京

杭州

浙江

(4 rows )

说明：

postgresql的排序除了受到数据库的编码影响外，还有一个初始化参数是locale也会影响(initdb)，,通常我的选择是C，这可以让postgres数据库通过strcmp()这个函数来比较字符串，而不是strcoll()函数。

这个参数可以在数据库里查看，如

								
									 postgres=# \l 

									                   List   of   databases 

									      Name      | Owner  | Encoding |   Collate   | Ctype |  Access   privileges 

									 -----------------+----------+----------+---------+-------+----------------------- 

									    dkenyon     | u_kenyon | UTF8   | C    | C   |  

									    postgres    | postgres | UTF8   | C    | C   |  

									    template0    | postgres | UTF8   | C    | C   | =c/postgres     + 

									            |     |     |     |    | postgres=CTc/postgres 

									    template1    | postgres | UTF8   | C    | C   | =c/postgres     + 

									            |     |     |     |    | postgres=CTc/postgres 

									 (6   rows  )

--简体中文在系统表里的支持

								
									 postgres=#   select   collname,collcollate,collctype,b.nspname,c.rolname   as   collowner  

									 postgres-#   from   pg_collation a,pg_namespace b,pg_authid c  

									 postgres-#   where   a.collnamespace = b.oid   and   a.collowner = c.oid   and   lower  (collname)   like   '%zh_cn%'  ; 

									     collname  | collcollate | collctype  | nspname  | collowner  

									 --------------+--------------+--------------+------------+----------- 

									    zh_CN    | zh_CN    | zh_CN    | pg_catalog | postgres 

									    zh_CN    | zh_CN.utf8  | zh_CN.utf8  | pg_catalog | postgres 

									    zh_CN.gb2312 | zh_CN.gb2312 | zh_CN.gb2312 | pg_catalog | postgres 

									    zh_CN.utf8  | zh_CN.utf8  | zh_CN.utf8  | pg_catalog | postgres 

									 (4   rows  )

因为初始化时选择的locale是C，所以数据库的默认排序也是C，要想字段内容按照中文拼音排序，需要将UTF8格式存储的内容转换为GBK方式。

解决办法：

1.转换字段的方式,加个convert_to前缀函数

								
									 postgres=#   select   vname   from   tbl_kenyon   order   by   convert_to(vname,  'GBK'  ); 

									    vname  

									 ------- 

									    北京 

									    杭州 

									    上海 

									    浙江 

									 (4   rows  )

--convert_to函数输入参数是text形式,输出编码是bytea形式，是将字符转换为目标编码的函数，如

postgres=# select convert_to( '浙江' , 'UTF8' ),( '浙江' , 'GBK' );

convert_to | row

----------------+------------

\xe6b599e6b19f | (浙江,GBK)

(1 row)

2.列指定zh_cn的方式存储

								
									 postgres=#   alter   table   tbl_kenyon   add   cname text   collate   "zh_CN"  ; 

									 ALTER   TABLE 

									 postgres=# \d tbl_kenyon  

									     Table   "public.tbl_kenyon" 

									    Column   | Type |  Modifiers   

									 --------+------+--------------- 

									    vname | text |  

									    cname | text |   collate   zh_CN 

									 postgres=#   select   *   from   tbl_kenyon; 

									    vname | cname  

									 -------+------- 

									    浙江 | 浙江 

									    杭州 | 杭州 

									    上海 | 上海 

									    北京 | 北京 

									 (4   rows  ) 

									 postgres=#   select   *   from   tbl_kenyon   order   by   vname; 

									    vname | cname  

									 -------+------- 

									    上海 | 上海 

									    北京 | 北京 

									    杭州 | 杭州 

									    浙江 | 浙江 

									 (4   rows  ) 

									 postgres=#   select   *   from   tbl_kenyon   order   by   cname; 

									    vname | cname  

									 -------+------- 

									    北京 | 北京 

									    杭州 | 杭州 

									    上海 | 上海 

									    浙江 | 浙江 

									 (4   rows  )

3.查询时指定collate

								
									 postgres=#   select   *   from   tbl_kenyon   order   by   vname   collate   "C"  ; 

									    vname | cname  

									 -------+------- 

									    上海 | 上海 

									    北京 | 北京 

									    杭州 | 杭州 

									    浙江 | 浙江 

									 (4   rows  ) 

									 postgres=#   select   *   from   tbl_kenyon   order   by   vname   collate   "zh_CN"  ; 

									    vname | cname  

									 -------+------- 

									    北京 | 北京 

									    杭州 | 杭州 

									    上海 | 上海 

									    浙江 | 浙江 

									 (4   rows  )

其他问题：

1.在用了方法一的convert_to函数转换一段时间后，开发告诉我说有异常,报错 character with byte sequence 0xc2 0xae in encoding "UTF8" has no equivalent in encoding "GBK"

1 2	Error querying database . Cause: org.postgresql.util.PSQLException: ERROR: character with byte sequence 0xc2 0xae in encoding "UTF8" has no equivalent in encoding "GBK"

排查了一下，发现数据库里存了一些比较奇怪的字符导致的，比如Mircle® city,niwhite®town。后对该表重建了一下，用方法二解决,所以convert_to函数使用对一些奇怪的字符转换时需要注意。

2.对于多音字，仍然会产生一定的歧义，比如重庆，会按Z去排序

上述办法能满足大部分汉字的拼音排序，但仍有一些不足。比较理想的解决办法是对这类基础数据录入时就指定拼音规则，或者数据库里存一份数据的拼音字典来关联使用。

其他：

使用zh_cn存储时测试字段大小，未测试取值速度

								
									 postgres=#   insert   into   tbl_kenyon   select   repeat(  '浙江GDOOASASHOME爱你'  ,5000), repeat(  '浙江GDOOASASHOME爱你'  ,5000) ; 

									 INSERT   0 1 

									 postgres=#   insert   into   tbl_kenyon   select   repeat(  '浙江GDOOASASHOME爱你'  ,50000), repeat(  '浙江GDOOASASHOME爱你'  ,50000) ; 

									 INSERT   0 1 

									 postgres=#   insert   into   tbl_kenyon   select   repeat(  '浙江GDOOASASHOME爱你'  ,100000), repeat(  '浙江GDOOASASHOME爱你'  ,100000) ; 

									 INSERT   0 1 

									 postgres=#   select   pg_column_size(cname),pg_column_size(vname)   from   tbl_kenyon ; 

									    pg_column_size | pg_column_size  

									 ----------------+---------------- 

									         1410 |      1406 

									        13769 |     13769 

									        27506 |     27506 

									 (3   rows  )

存储差异并不大

补充

1 2	#高版本可能不支持,或者语法不对？ select * from store order by storename collate 'zh_CN' ;

以上为个人经验，希望能给大家一个参考，也希望大家多多支持服务器之家。如有错误或未考虑完全的地方，望不吝赐教。

原文链接：https://my.oschina.net/Kenyon/blog/183063

查看更多关于PostgreSQL的中文拼音排序案例的详细内容...

声明：本文来自网络，不代表【好得很程序员自学网】立场，转载请注明出处：http://www.haodehen.cn/did229573

更新时间：2023-05-14 阅读：98次