回顾NLP必会Gensim

学习一时爽，一直学习一直爽 —— connor口头禅

Hello，大家好，我是もうり，一个从无到有的技术小白。

突然有人问我gensim是啥？

Gensim都不知道

NLP就别玩了

我翻下博客

还真的学过gensim

看了下又想起来了

下面使用Gensim

统计每个单词的tfidf

什么是Gensim

Gensim是一款开源的第三方Python工具包，用于从原始的非结构化的文本中，无监督地学习到文本隐层的主题向量表达。它支持包括TF-IDF，LSA，LDA，和word2vec在内的多种主题模型算法，支持流式训练，并提供了诸如相似度计算，信息检索等一些常用任务的API接口

补充一些概念:

语料（Corpus）：一组原始文本的集合，用于无监督地训练文本主题的隐层结构。语料中不需要人工标注的附加信息。在Gensim中，Corpus通常是一个可迭代的对象（比如列表）。每一次迭代返回一个可用于表达文本对象的稀疏向量。

向量（Vector）：由一组文本特征构成的列表。是一段文本在Gensim中的内部表达。

稀疏向量（Sparse Vector）：通常，我们可以略去向量中多余的0元素。此时，向量中的每一个元素是一个(key, value)的tuple。

模型（Model）：是一个抽象的术语。定义了两个向量空间的变换（即从文本的一种向量表达变换为另一种向量表达）。

corpora, models, similarities 这三个是gensim的重要使用的类

最好的学习就是熟练掌握官方文档

处理字符串

包含9个文档，每个文档仅包含一个句子。

 >>> documents = ["Human machine interface for lab abc computer applications", >>>              "A survey of user opinion of computer system response time", >>>              "The EPS user interface management system", >>>              "System and human system engineering testing of EPS", >>>              "Relation of user perceived response time to error measurement", >>>              "The generation of random binary unordered trees", >>>              "The intersection graph of paths in trees", >>>              "Graph minors IV Widths of trees and well quasi ordering", >>>              "Graph minors A survey"]

首先，让我们对文档进行标记化，删除常用单词,

 from pprint import pprint  # pretty-printer # 计数 from collections import defaultdict  # remove common words and tokenize  # 去掉停用词 stoplist = set('for a of the and to in'.split()) texts = [    [word for word in document.lower().split() if word not in stoplist]  for document in documents ]  # remove words that appear only once # 计算词频 frequency = defaultdict(int) for text in texts:     for token in text:        frequency[token] += 1 # 词频是1的，没用 texts = [     [token for token in text if frequency[token] > 1]      for text in texts ]  pprint(texts)  OUT： [['human', 'interface', 'computer'],  ['survey', 'user', 'computer', 'system', 'response', 'time'],  ['eps', 'user', 'interface', 'system'],  ['system', 'human', 'system', 'eps'],  ['user', 'response', 'time'],  ['trees'],  ['graph', 'trees'],  ['graph', 'minors', 'trees'],  ['graph', 'minors', 'survey']]

corpora（语料库）

训练语料的预处理工作就完成了。我们得到了语料中每一篇文档对应的稀疏向量（这里是bow向量）；向量的每一个元素代表了一个word在这篇文档中出现的次数。值得注意的是，虽然词袋模型是很多主题模型的基本假设，这里介绍的 doc2bow 函数，并不是将文本转化成稀疏向量的唯一途径。

转化成稀疏向量

 texts = [['human', 'interface', 'computer'],  ['survey', 'user', 'computer', 'system', 'response', 'time'],  ['eps', 'user', 'interface', 'system'],  ['system', 'human', 'system', 'eps'],  ['user', 'response', 'time'],  ['trees'],  ['graph', 'trees'],  ['graph', 'minors', 'trees'],  ['graph', 'minors', 'survey']]  from gensim import corpora dictionary = corpora.Dictionary(texts) print(dictionary) # Dictionary(12 unique tokens: ['computer', 'human', 'interface', 'response', 'survey']...) # 对应的词频 print(dictionary.token2id) # {'minors': 11, 'graph': 10, 'system': 5, 'trees': 9, 'eps': 8, 'computer': 0,'survey': 4, 'user': 7, 'human': 1, 'time': 6, 'interface': 2, 'response': 3} # 标记化文档转换为矢量doc2bow方法 corpus = [dictionary.doc2bow(text) for text in texts] print (corpus[0]) # [(0, 1), (1, 1), (2, 1)] # (0, 1)代表'human' print(len(corpus)) #  9

models（主题向量的变换）

 from gensim import models tfidf = models.TfidfModel(corpus) print(tfidf) # TfidfModel(num_docs=9, num_nnz=28)

从现在开始，tfidf它被视为只读对象，可用于将任何矢量从旧表示形式（单词袋整数计数）转换为新表示形式（TfIdf实值权重），其中，corpus是一个返回bow向量的迭代器。这两行代码将完成对corpus中出现的每一个特征的IDF值的统计工作。

 tfidf.save("model.tfidf")#保存 tfidf = models.TfidfModel.load("model.tfidf")#加载

使用models

 # [(0, 1), (1, 1)] 代表'human', 'interface' doc_bow = [(0, 1), (1, 1)] # TfIdf实值权 print(tfidf[doc_bow])  # gensim训练出来的tf-idf值左边是词的id，右边是词的tfidf值 OUT： [(0, 0.70710678118654757), (1, 0.70710678118654757)]

使用到整个文库

 corpus_tfidf = tfidf[corpus] for doc in corpus_tfidf:       print(doc)   OUT： [(0, 0.57735026918962573), (1, 0.57735026918962573), (2, 0.57735026918962573)] [(0, 0.44424552527467476), (3, 0.44424552527467476), (4, 0.44424552527467476), (5, 0.32448702061385548), (6, 0.44424552527467476), (7, 0.32448702061385548)] [(2, 0.5710059809418182), (5, 0.41707573620227772), (7, 0.41707573620227772), (8, 0.5710059809418182)] [(1, 0.49182558987264147), (5, 0.71848116070837686), (8, 0.49182558987264147)] [(3, 0.62825804686700459), (6, 0.62825804686700459), (7, 0.45889394536615247)] [(9, 1.0)] [(9, 0.70710678118654746), (10, 0.70710678118654746)] [(9, 0.50804290089167492), (10, 0.50804290089167492), (11, 0.69554641952003704)] [(4, 0.62825804686700459), (10, 0.45889394536615247), (11, 0.62825804686700459)]

好了，这就是今天的内容了，今天最后我有一句话要说：

学习一时爽，一直学习一直爽

早く行ってほめてください

查看更多关于回顾NLP必会Gensim的详细内容...

声明：本文来自网络，不代表【好得很程序员自学网】立场，转载请注明出处：http://www.haodehen.cn/did127783

更新时间：2022-11-28 阅读：36次