lucenc代码阅读指南、测试范例

lucenc代码阅读指南、测试范例

lucenc代码阅读指南、测试范例
阅读指南

Lucene 原理与代码分析完整版 -- 力荐
Lucene介绍及源码剖析: http://javenstudio.org/blog/annotated-lucene -- 核心IndexWriter

下载：Annotated+Lucene+.pdf： http://ishare.iask.sina测试数据.cn/f/24103589.html

阅读步骤：

1、了解检索的基本原理和概念

2、了解lucene的基本概念

3、熟悉lucene的索引文件格式 -- 关键

4、熟悉lucene的索引流程：具体代码的类层次较多，且引入不必要的设计模式致使代码阅读相对困难。基本思路：controler + model 封装索引链，实现多线程并发处理(数据不共享)。

5、熟悉lucene的搜索流程

6、了解lucene搜索语法解析器和熟悉分词

推荐资料深入剖析lucene的源码，非常有价值。光看文档，不够形象，大体看过文档后，建议结合源码理解文档内容。代码能让读者有大体的基本概念，但文档对源码细节的解释容易让读者"只见枝叶不见森林”，理解困难。根据文档作者提供的大体思路，结合实际源码，读起来更容易。

测试

测试对于了解lucene的工作原理、代码执行流程极有帮助，是阅读代码的重要辅助手段。

IndexerExample.java

 /*  
 * Compiler: javac -classpath .:lucene-core-2.9.1.jar:http://HdhCmsTestcnblogs测试数据/ChineseSegmenter/chineseSegmenter.jar  IndexerExample.java  
 * Exec    : java  -classpath .:lucene-core-2.9.1.jar:http://HdhCmsTestcnblogs测试数据/ChineseSegmenter/chineseSegmenter.jar  IndexerExample  
 *
   */ 

 import   java.io.BufferedReader;
  import   java.io.File;
  import   java.io.FileReader;
  import   java.io.FileInputStream;
  import   java.io.IOException;
  import   java.io.InputStreamReader;
  import   java.io.StringReader;

  import   org.apache.lucene.analysis.Analyzer;
  import   org.apache.lucene.analysis.TokenStream;
  import   org.apache.lucene.analysis.standard.StandardAnalyzer;
  import   org.apache.lucene.analysis.PerFieldAnalyzerWrapper;
  import   org.apache.lucene.analysis.WhitespaceAnalyzer;
  import   org.apache.lucene.analysis.cn.ChineseAnalyzer;
  import   org.apache.lucene.document.Document;
  import   org.apache.lucene.document.Field;
  import   org.apache.lucene.document.DateTools;
  import   org.apache.lucene.index.IndexWriter;
  import   org.apache.lucene.store.Directory;
  import   org.apache.lucene.store.FSDirectory;


  public   class   IndexerExample {
    
      private   static   void  EnExample()  throws   Exception {

          //   Store the index on disk 
        Directory directory = FSDirectory.getDirectory("/tmp/testindex" );
          //   Use standard analyzer 
        Analyzer analyzer =  new   StandardAnalyzer();
          //   Create IndexWriter object 
        IndexWriter iwriter =  new  IndexWriter(directory, analyzer,  true  );
        iwriter.setMaxFieldLength( 25000 );
          //   make a new, empty document 
        Document doc =  new   Document();
        File f  =  new  File("/tmp/test.txt" );
        
          //   Add the path of the file as a field named "path".  Use a field that is
          //   indexed (i.e. searchable), but don't tokenize the field into words. 
        doc.add( new  Field("path" , f.getPath(), Field.Store.YES, Field.Index.UN_TOKENIZED));
        
        String text  = "This is the text to be indexed." ;
        doc.add(  new  Field("fieldname" , text, Field.Store.YES,      Field.Index.TOKENIZED));
        doc.add(  new  Field("name" , text, Field.Store.YES,      Field.Index.TOKENIZED));
        
          //   Add the last modified date of the file a field named "modified".  Use
          //   a field that is indexed (i.e. searchable), but don't tokenize the field
          //   into words. 
        doc.add( new  Field("modified" ,
                    DateTools.timeToString(f.lastModified(), DateTools.Resolution.MINUTE),
                    Field.Store.YES, Field.Index.UN_TOKENIZED));
          //   Add the contents of the file to a field named "contents".  Specify a Reader,
          //   so that the text of the file is tokenized and indexed, but not stored.
          //   Note that FileReader expects the file to be in the system's default encoding.
          //   If that's not the case searching for special characters will fail. 
        doc.add( new  Field("contents",  new   FileReader(f)));
        
        iwriter.addDocument(doc);
        iwriter.optimize();
        iwriter.close();

    }
 
      private   static   void  CnExample()  throws   Exception {

          //   Store the index on disk 
        Directory directory = FSDirectory.getDirectory("/tmp/testindex" );
          //   Use chinese analyzer 
        Analyzer analyzer =  new   ChineseAnalyzer();
        PerFieldAnalyzerWrapper wrapper  =  new  PerFieldAnalyzerWrapper( new   WhitespaceAnalyzer());
        wrapper.addAnalyzer( "name" , analyzer);
        
          //   Create IndexWriter object 
        IndexWriter iwriter =  new  IndexWriter(directory, wrapper,  true  );
        iwriter.setMaxFieldLength( 25000 );
          //   make a new, empty document 
        Document doc =  new   Document();
        File f  =  new  File("/tmp/test.txt" );
        
          //   Add the path of the file as a field named "path".  Use a field that is
          //   indexed (i.e. searchable), but don't tokenize the field into words. 
        doc.add( new  Field("path" , f.getPath(), Field.Store.YES, Field.Index.UN_TOKENIZED));
        
        String text  = "This is the text to be indexed." ;
        doc.add(  new  Field("fieldname" , text, Field.Store.YES, Field.Index.TOKENIZED));
        
        String name  = "2013春装新款女气质修身风衣大翻领双层大摆长款外套 系腰带" ;
        doc.add(  new  Field("name" , name, Field.Store.YES, Field.Index.TOKENIZED));
        
          //   Add the last modified date of the file a field named "modified".  Use
          //   a field that is indexed (i.e. searchable), but don't tokenize the field
          //   into words. 
        doc.add( new  Field("modified" ,
                    DateTools.timeToString(f.lastModified(), DateTools.Resolution.MINUTE),
                    Field.Store.YES, Field.Index.UN_TOKENIZED));
          //   Add the contents of the file to a field named "contents".  Specify a Reader,
          //   so that the text of the file is tokenized and indexed, but not stored.
          //   Note that FileReader expects the file to be in the system's default encoding.
          //   If that's not the case searching for special characters will fail. 
        doc.add( new  Field("contents",  new   FileReader(f)));
        
        iwriter.addDocument(doc);
        iwriter.optimize();
        iwriter.close();
    }

      public   static   void  main(String[] args)  throws   Exception {
        System.out.println( "Start test: " );

          if ( args.length > 0 ){
            CnExample();
        }
          else  {
            EnExample();
        }

        System.out.println( "Index dir: /tmp/testindex" );
    }
}

SearcherExample.java

 /*  
 * Compiler: javac -classpath .:lucene-core-2.9.1.jar:http://HdhCmsTestcnblogs测试数据/ChineseSegmenter/chineseSegmenter.jar  SearcherExample.java  
 * Exec    : java  -classpath .:lucene-core-2.9.1.jar:http://HdhCmsTestcnblogs测试数据/ChineseSegmenter/chineseSegmenter.jar  SearcherExample
 * 
   */ 

 import   java.io.BufferedReader;
  import   java.io.File;
  import   java.io.FileReader;
  import   java.io.FileInputStream;
  import   java.io.IOException;
  import   java.io.InputStreamReader;
  import   java.io.StringReader;
  import   java.util.Date;

  import   org.apache.lucene.analysis.Analyzer;
  import   org.apache.lucene.analysis.TokenStream;
  import   org.apache.lucene.analysis.standard.StandardAnalyzer;
  import   org.apache.lucene.analysis.cn.ChineseAnalyzer;
  import   org.apache.lucene.document.Document;
  import   org.apache.lucene.document.Field;
  import   org.apache.lucene.document.DateTools;
  import   org.apache.lucene.index.IndexWriter;
  import   org.apache.lucene.store.Directory;
  import   org.apache.lucene.store.FSDirectory;

  import   org.apache.lucene.search.IndexSearcher;
  import   org.apache.lucene.search.Searcher;
  import   org.apache.lucene.search.Query;
  import   org.apache.lucene.search.Hits;
  import   org.apache.lucene.queryParser.QueryParser;


  public   class   SearcherExample { 

      public   static   void  main(String[] args)  throws   Exception { 
          if  (args.length < 2 ) { 
              throw   new  Exception("Usage: java " + Searcher. class  .getName() 
                     + "<index dir> <query> [cn]" ); 
        } 
        File indexDir  =  new  File(args[0 ]);
        String q  = args[1 ]; 
          boolean  bCn = args.length > 2?  true  :  false  ;

          if  (!indexDir.exists() || ! indexDir.isDirectory()) { 
              throw   new  Exception(indexDir + 
                    " does not exist or is not a directory." ); 
        } 
        search(indexDir, q, bCn); 
    } 

      public   static   void  search(File indexDir, String q,  boolean   bCn) 
          throws   Exception { 
        Directory fsDir  = FSDirectory.getDirectory(indexDir,  false  ); 
        IndexSearcher is  =  new   IndexSearcher(fsDir);

        Analyzer analyzer  =  new   StandardAnalyzer();
          if  ( bCn ){
            analyzer  =  new   ChineseAnalyzer();
        }

        QueryParser parser  =  new  QueryParser( "name" ,  analyzer);
        Query query  =  parser.parse(q); 
        
        System.out.println( "Query: " +  query.toString());
          long  start =  new   Date().getTime(); 
        Hits hits  =  is.search(query);
          long  end =  new   Date().getTime(); 

        System.err.println( "Found " + hits.length() + 
                " document(s) (in " + (end - start) + 
                " milliseconds) that matched query '" +  
                q  + "'" ); 

          for  ( int  i = 0; i < hits.length(); i++ ) { 
            Document doc  =  hits.doc(i); 
            System.out.println(  "HIT " + i + " :" + doc.get("name" )); 
        } 
    } 
}

中文分词可采用lucene自带的库，效果不好，或者自行封装，核心就是封装分词Tokenizer。

 package   org.apache.lucene.analysis.cn;

  import   java.io.BufferedReader;
  import   java.io.File;
  import   java.io.FileInputStream;
  import   java.io.IOException;
  import   java.io.InputStreamReader;
  import   java.io.Reader;
  import   java.nio.charset.Charset;
  import   java.util.ArrayList;
  import   java.util.HashSet;
  import   java.util.Iterator;
  import   java.util.List;

  import   org.apache测试数据mons.lang.StringUtils;
  import   org.apache.lucene.analysis.Token;
  import   org.apache.lucene.analysis.Tokenizer;

  public   class  SnippetTermTokenizer  extends   Tokenizer {
          private  StringBuffer buffer =  new   StringBuffer();
          private   BufferedReader inputBuffer;
          private  JNISelecter selecter;      //   中文分词核心类 
         private  List<Token> tokenList =  null  ;
          private  List<String> phraseTokenList =  null  ;
          private  Iterator<Token> tokenIter =  null  ;

          public   SnippetTermTokenizer(Reader reader, JNISelecter s) {
                inputBuffer  =  new  BufferedReader(reader, 2048 );
                selecter  =  s;
        }

          public  Token next()  throws   IOException {
                  if  (tokenIter !=  null  ) {
                          if   (tokenIter.hasNext()) {
                                  return   tokenIter.next();
                        }   else   {
                                  //   finish read input 
                                 return   null  ;
                        }
                }
                  //   need to read content 
                 readContent();
                  if   (segment()) {
                          //   segment succeed, create iterator 
                         return   tokenIter.next();
                }
                  return   null  ;
        }

          public   void  close()  throws   IOException {
                inputBuffer.close();
        }
       
          //   分词相关略 
｝

作者： zhenjing.chen
出处： http://HdhCmsTestcnblogs测试数据/zhenjing/
未注明转载的文章，版权归作者所有，欢迎转载，但未经作者同意必须保留此段声明，且在文章页面明显位置给出原文连接，否则保留追究法律责任的权利。

分类: Spring

标签: lucene , source code , 源码

CYQ.Data V5文本数据库技术原理解密

前言：

这两天有点感冒状态，除了以前折腾微博粉丝精灵腾到三更，最近也在折腾个别工具到四更，偶尔心来心潮，赶紧写写文章，最近有很多朋友对 CYQ.Data V5里的文本数据库感兴趣，这里就给大伙说下文本数据库技术原理，给大伙解下密。

CYQ.Data 框架的稳定与前进：

CYQ.Data 对于V4系列，版本号就在V4.55版本就不再提高了，主要是为了保留一个最稳定的版本，基本除了Bug修正，不会再有大于V4.N的版本号出现了。

而V5版本，这一两年来，事实上，代码改动相当大，内部类的结构也调整不少，由于功能的新增加导致和改动，被记录的就有四五十项，没记录的数不清了，不过V5在各项目中沉淀了也近一年了，稳定性也很强。

CYQ.Data 文本数据库：

以前写过相关的文章：周末一起用文本数据库玩玩Code First

对于文本数据库的应用，在原来的 OAuth2 登陆组件，就应用上了，用户存储AccessToken和对应账号，不知道大伙注意到了没有。

这里我再举昨天发布的 CYQ.Data.ProjectTool 项目配置工具发布（包源码），保存用户配置的也是文本数据库。

CYQ.Data 文本数据库技术原理：

先看图片：

简单工程描述：

在这个项目中，主要是读取表结构，然后生成相应的枚举类，或实体类，同时需要保存用户的配置，根据配置名称，允许保存多个配置项，而且每次开启，需要还原用户最新的配置项。

用什么来保存配置？

在以前，我用了App.config来保存配置，当然也可以用Xml，ini文件，不过，代码写起来都相对比较费力，人总是在使用某些东西觉的费力麻烦的时候，就会出一种方案来解决它们，所以，文本数据库就是这样一种简单的方案。

下面用源码里的部分代码片断来解密文本数据库的本质：

这个简单的工程，事实包含了数据库基本的增删改查操作：

1：添加用户配置。

2：同配置名，更新用户配置。

3：删除用户配置（估计被我遗忘了，没加上这功能)

4：从所有配置中，查询出被标识为IsMain=true的数据，并还原为默认配置。

项目里建了一个ProjectConfig实体类，来个CodeFirst：

这实体多了一个ORM继承，让它具备ORM的基础功能：

构造函数要指定表名和数据库链接（可以是配置名，内部判断的依据是有没有空格，没空格则到web.config取，有空格当成链接）

实体类代码如下：

CYQ.Data.ProjectTool namespace CYQ.Data.ProjectTool
{
     public class ProjectConfig : CYQ.Data.Orm.OrmBase
    {
         public ProjectConfig()
        {
             base .SetInit( this , " ProjectConfig " , " Txt Path={0} " );
        }
         private int _ID;
         /// <summary>
         /// 标识
         /// </summary>
         public int ID
        {
             get
            {
                 return _ID;
            }
             set
            {
                _ID = value;
            }
        }
         private string _Name;
         /// <summary>
         /// 配置名称
         /// </summary>
         public string Name
        {
             get
            {
                 return _Name;
            }
             set
            {
                _Name = value;
            }
        }

         private string _Conn;
         /// <summary>
         /// 链接字符串
         /// </summary>
         public string Conn
        {
             get
            {
                 return _Conn;
            }
             set
            {
                _Conn = value;
            }
        }

         private string _DBType;
         /// <summary>
         /// 数据库类型
         /// </summary>
         public string DBType
        {
             get
            {
                 return _DBType;
            }
             set
            {
                _DBType = value;
            }
        }

         private bool _MutilDatabase;
         /// <summary>
         /// 支持多数据库模式
         /// </summary>
         public bool MutilDatabase
        {
             get
            {
                 return _MutilDatabase;
            }
             set
            {
                _MutilDatabase = value;
            }
        }
         private string _ProjectPath;

         public string ProjectPath
        {
             get
            {
                 return _ProjectPath;
            }
             set
            {
                _ProjectPath = value;
            }
        }
         private bool _IsMain;

         public bool IsMain
        {
             get
            {
                 return _IsMain;
            }
             set
            {
                _IsMain = value;
            }
        }
         private string _BuildMode;
         /// <summary>
         /// 创建模式（枚举模式；ORM实体类模式）
         /// </summary>
         public string BuildMode
        {
             get
            {
                 return _BuildMode;
            }
             set
            {
                _BuildMode = value;
            }
        }
         private string _NameSpace;
         /// <summary>
         /// 默认的名称空间
         /// </summary>
         public string NameSpace
        {
             get
            {
                 return _NameSpace;
            }
             set
            {
                _NameSpace = value;
            }
        }

    }

PS：如果你数据库链接从Txt Path改成Xml Path，你会发现，存储的格式变成Xml。

看一段增加配置的代码片断：

在用户点击“测试链接”或“生成文件”时，如果检测到链接是成功的，则自动保存当前配置信息,代码如下：

string SaveConfig()

{
             string name = ddlName.Text.Trim();
             if ( string .IsNullOrEmpty(name))
            {
                name = " DefaultConn " ;
            }
            ResetMainState();
             using (ProjectConfig config = new ProjectConfig())
            {
                config.SetAutoParentControl(gbConn, gbBuild);

                 if (config.Fill( " Name=' " + name + " ' " ))
                {
                    config.IsMain = true ;
                    config.Update( null , true );
                }
                 else
                {
                    config.IsMain = true ;
                     if (config.Insert( true ))
                    {
                        ddlName.Items.Add(name);
                    }
                }
            }
             return name;
        }

看看实体类New的这一行代码：

using (ProjectConfig config = new ProjectConfig())

using 语法：

不多解释了，结束的时候，它自动会调用disponse方法，自动关闭并释放相关资源。

构造函数New初始化：

系统会获取实体类上的属性成员，组成一个表结构，然后写到数据库链接里指定的路径。

你运行软件后，会自动发下在指定的目录下多了一个文件：ProjectConfig.ts

里面存储了表的结构，内容如下：

ID,Int,False,False,0,;

Name,NVarChar,False,True,0,;

Conn,NVarChar,False,True,0,;

DBType,NVarChar,False,True,0,;

MutilDatabase,Bit,False,True,0,;

ProjectPath,NVarChar,False,True,0,;

IsMain,Bit,False,True,0,;

BuildMode,NVarChar,False,True,0,;

NameSpace,NVarChar,False,True,0,;

简单说明：

表结构存储格式为：名称，类型，允许为空,是否只读，长度，默认值。

对于文本数据库，就是创建了“表名.ts"，如果是其它数据库链接，就会直接在数据库创建相应的表。

后面的属性赋值，和ORM的基础操作方法，我们简单略过：

这里使用和UI结合的方式取值： config.SetAutoParentControl(gbConn, gbBuild);

自动多两个GroupBox里的子控件里取值。

而IsMain属性，是不在控件里出现的，所以需要单独赋值。

通过和UI结合，在还原配置项的时候，也只要一句SetToAll()，就搞定了：

源码里有这样一段代码，根据配置名，还原所有配置，看一眼就可以了：

void LoadConfig( string name)
        {
             if (! string .IsNullOrEmpty(name))
            {
                 using (ProjectConfig config = new ProjectConfig())
                {
                     if (config.Fill( " Name=' " + name + " ' " ))
                    {
                        config.SetToAll(gbConn, gbBuild);
                    }
                }
            }
        }

对于文本数据库，刚才只是说自动生成了表结构，存储为“*.ts"文件了。

对于数据的存储呢？

CYQ.Data 对于本文数据库而言，有两个核心类：

1：JsonHelper：和Json打交道的类。

2：MDataTalle：内存表，功能很强大，具备和Json或Xml加载与输出的功能。

文本数据库的实现，正是基于这个两个类。

在存储时，如果有多个文本，则是多个Static MDataTable

通过MDataTable与Json（或Xml）的交互，加载与输入来实现。

所以若运行后，基本可以看到这个文件： ProjectConfig.txt，里面存储着json文件如下：

{ "ID" : "System.Int32","Name":"System.String","Conn":"System.String","DBType":"System.String","MutilDatabase":"System.Boolean","ProjectPath":"System.String","IsMain":"System.Boolean","BuildMode":"System.String","NameSpace":"System.String" } ,
{ "ID" : "1","Name":"DefaultConn","Conn":"server=. ; database=qblog;uid=sa;pwd=123456","DBType" : "Mssql","MutilDatabase":"False","ProjectPath":"","IsMain":"True","BuildMode":"实体型（ORM操作方式）","NameSpace":"Web.Entity" }

简单说明：

第一行，根据某些情况，可能存储数据类型，这是为丢失表结构的情况下，从Json还原为MDataTable时，仍能有基本的数据结构存在。

当然第一行也可能直接就是json数据了，系统根据某些特定标识来识别第一行是架构还是数据。

文件数据库的增删改查原理：

如果你对DataTable熟悉，相信也对MDataTable也熟悉，文本数据库的增删改查，全在MDataTable里进行。

框架的统一：分页与Sql查询语法：

对于文本而言，存储的结果就是json，为了多数据的统一，使的它必需具备基础的数据库应有的功能，分页与sql语句语法的查询。

为此，我对MDataTable进行了一个重要功能的补充，对sql语句进行解析，然后进行列的比较，再对数据行进行自定义排序，从而筛选出最终结果。

为此，MDataTable事实上，就是一个具备分页，查询功能的强大表类，而且可以脱离数据库，拿到数据后，可以继续再进行分页查询操作。

MDataTalle的其它特性：

同时，MDataTable还具备基础的批量插入和更新功能，这个功能很要（CYQ.DBImport 多数据库数据互导功能，事实上也是用了这个功能，从一个数据库查询出一个MDataTable，然后调用AccpertChange函数，就可以批量转移到其它数据库了，几行代码就实现了，非常方便）

总结：

基本文件数据库到这里也没啥秘密了：

1：根据实体类自动生成表结构（如果已存在表结构，自动加载）。

2：存储格式是Json，依赖JsonHelper和MDataTable进行加载和写入互动。

3：增删改查，实际是依赖于MDataTable，对数据行的增删改查。

4：CodeFirst模式，本质是基于MAction的实现。

欢迎路过的吐槽。。。

分类: CYQ.Data 框架系列

标签: 文本数据库 , orm , cyq.data 文本数据库

作者： Leo_wl

出处： http://HdhCmsTestcnblogs测试数据/Leo_wl/

本文版权归作者和博客园共有，欢迎转载，但未经作者同意必须保留此段声明，且在文章页面明显位置给出原文连接，否则保留追究法律责任的权利。

版权信息

查看更多关于lucenc代码阅读指南、测试范例的详细内容...

声明：本文来自网络，不代表【好得很程序员自学网】立场，转载请注明出处：http://www.haodehen.cn/did46594

更新时间：2022-09-24 阅读：50次

lucenc代码阅读指南、测试范例

lucenc代码阅读指南、测试范例 lucenc代码阅读指南、测试范例 阅读指南

Lucene 原理与代码分析完整版 -- 力荐 Lucene介绍及源码剖析: http://javenstudio.org/blog/annotated-lucene -- 核心IndexWriter 下载：Annotated+Lucene+.pdf： http://ishare.iask.sina测试数据.cn/f/24103589.html

lucenc代码阅读指南、测试范例

lucenc代码阅读指南、测试范例
阅读指南

Lucene 原理与代码分析完整版 -- 力荐
Lucene介绍及源码剖析: http://javenstudio.org/blog/annotated-lucene -- 核心IndexWriter

下载：Annotated+Lucene+.pdf： http://ishare.iask.sina测试数据.cn/f/24103589.html