Word_tokenize Typeerror: Expected String Or Buffer
When calling word_tokenize I get the following error: File 'C:\Python34\lib\site-packages\nltk\tokenize\punkt.py', line 1322, in _slices_from_text for match in self._lang_v
Solution 1:
The input for word_tokenize
is a document stream sentence, i.e. a list of strings, e.g. ['this is sentence 1.', 'that's sentence 2!']
.
The File_1500
is a File
object not a list of strings, that's why it's not working.
To get a list of sentence strings, first you have to read the file as a string object fin.read()
, then use sent_tokenize
to split the sentence up (I'm assuming that your input file is not sentence tokenized, just a raw textfile).
Also, it's better / more idiomatic to tokenize a file this way with NLTK:
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize, sent_tokenize
stop_words = set(stopwords.words("english"))
withopen('E:\\Book\\1500.txt', "r", encoding='ISO-8859-1') as fin:
for sent in sent_tokenize(fin.read()):
words = word_tokenize(sent)
filtered_sentence = [w for w in words ifnot w in stop_words]
print(filtered_sentence)
Post a Comment for "Word_tokenize Typeerror: Expected String Or Buffer"