在 Linux 下,我设置了env var $NLTK_DATA(‘/ home / user / data / nltk’),并且按预期进行了测试
>>> from nltk.corpus import brown >>> brown.words() ['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', ...]
但是当运行另一个python脚本时,我得到了:
LookupError: ********************************************************************** Resource u'tokenizers/punkt/english.pickle' not found. Please use the NLTK Downloader to obtain the resource: >>> nltk.download() Searched in: - '/home/user/nltk_data' - '/usr/share/nltk_data' - '/usr/local/share/nltk_data' - '/usr/lib/nltk_data' - '/usr/local/lib/nltk_data' - u''
我们可以看到,在手动附加NLTK_DATA目录后,nltk不会向搜索路径添加$NLTK_DATA:
nltk.data.path.append("/NLTK_DATA_DIR");
脚本按预期运行,问题是:
如何让nltk自动将$NLTK_DATA添加到它的搜索路径?
如果您不想在运行脚本之前设置$NLTK_DATA,则可以在python脚本中执行以下操作:import nltk
nltk.path.append('/home/alvas/some_path/nltk_data/')
例如.让我们将nltk_data移动到NLTK无法自动找到的非标准路径:
alvas@ubi:~$ls nltk_data/ chunkers corpora grammars help misc models stemmers taggers tokenizers alvas@ubi:~$mkdir some_path alvas@ubi:~$mv nltk_data/ some_path/ alvas@ubi:~$ls nltk_data/ ls: cannot access nltk_data/: No such file or directory alvas@ubi:~$ls some_path/nltk_data/ chunkers corpora grammars help misc models stemmers taggers tokenizers
现在,我们使用nltk.path.append()hack:
alvas@ubi:~$python
>>> import os
>>> import nltk
>>> nltk.path.append('/home/alvas/some_path/nltk_data/')
>>> nltk.pos_tag('this is a foo bar'.split())
[('this', 'DT'), ('is', 'VBZ'), ('a', 'DT'), ('foo', 'JJ'), ('bar', 'NN')]
>>> nltk.data
<module 'nltk.data' from '/usr/local/lib/python2.7/dist-packages/nltk/data.pyc'>
>>> nltk.data.path
['/home/alvas/some_path/nltk_data/', '/home/alvas/nltk_data', '/usr/share/nltk_data', '/usr/local/share/nltk_data', '/usr/lib/nltk_data', '/usr/local/lib/nltk_data']
>>> exit()
让我们把它移回去看它是否有效:
alvas@ubi:~$ls nltk_data
ls: cannot access nltk_data: No such file or directory
alvas@ubi:~$mv some_path/nltk_data/ .
alvas@ubi:~$python
>>> import nltk
>>> nltk.data.path
['/home/alvas/nltk_data', '/usr/share/nltk_data', '/usr/local/share/nltk_data', '/usr/lib/nltk_data', '/usr/local/lib/nltk_data']
>>> nltk.pos_tag('this is a foo bar'.split())
[('this', 'DT'), ('is', 'VBZ'), ('a', 'DT'), ('foo', 'JJ'), ('bar', 'NN')]
如果您真的想自动找到nltk_data,请使用以下内容:
import scandir
import os, sys
import time
import nltk
def find(name, path):
for root, dirs, files in scandir.walk(path):
if root.endswith(name):
return root
def find_nltk_data():
start = time.time()
path_to_nltk_data = find('nltk_data', '/')
print >> sys.stderr, 'Finding nltk_data took', time.time() - start
print >> sys.stderr, 'nltk_data at', path_to_nltk_data
with open('where_is_nltk_data.txt', 'w') as fout:
fout.write(path_to_nltk_data)
return path_to_nltk_data
def magically_find_nltk_data():
if os.path.exists('where_is_nltk_data.txt'):
with open('where_is_nltk_data.txt') as fin:
path_to_nltk_data = fin.read().strip()
if os.path.exists(path_to_nltk_data):
nltk.data.path.append(path_to_nltk_data)
else:
nltk.data.path.append(find_nltk_data())
else:
path_to_nltk_data = find_nltk_data()
nltk.data.path.append(path_to_nltk_data)
magically_find_nltk_data()
print nltk.pos_tag('this is a foo bar'.split())
我们称之为python脚本test.py:
alvas@ubi:~$ls nltk_data/
chunkers corpora grammars help misc models stemmers taggers tokenizers
alvas@ubi:~$python test.py
Finding nltk_data took 4.27330780029
nltk_data at /home/alvas/nltk_data
[('this', 'DT'), ('is', 'VBZ'), ('a', 'DT'), ('foo', 'JJ'), ('bar', 'NN')]
alvas@ubi:~$mv nltk_data/ some_path/
alvas@ubi:~$python test.py
Finding nltk_data took 4.75850391388
nltk_data at /home/alvas/some_path/nltk_data
[('this', 'DT'), ('is', 'VBZ'), ('a', 'DT'), ('foo', 'JJ'), ('bar', 'NN')]
查看更多关于在Python中nltk不会在搜索路径中添加$NLTK_DATA吗?的详细内容...
声明:本文来自网络,不代表【好得很程序员自学网】立场,转载请注明出处:http://www.haodehen.cn/did171238