python - TfidfVectorizer in scikit-learn : ValueError: np.nan is an invalid document -


i'm using tfidfvectorizer scikit-learn feature extraction text data. have csv file score (can +1 or -1) , review (text). pulled data dataframe can run vectorizer.

this code:

import pandas pd import numpy np sklearn.feature_extraction.text import tfidfvectorizer  df = pd.read_csv("train_new.csv",              names = ['score', 'review'], sep=',')  # x = df['review'] == np.nan # # print x.to_csv(path='findnan.csv', sep=',', na_rep = 'string', index=true) # # print df.isnull().values.any()  v = tfidfvectorizer(decode_error='replace', encoding='utf-8') x = v.fit_transform(df['review']) 

this traceback error get:

traceback (most recent call last):   file "/home/pycharmprojects/review/src/feature_extraction.py", line 16, in <module> x = v.fit_transform(df['review'])  file "/home/b/hw1/local/lib/python2.7/site-   packages/sklearn/feature_extraction/text.py", line 1305, in fit_transform    x = super(tfidfvectorizer, self).fit_transform(raw_documents)  file "/home/b/work/local/lib/python2.7/site-packages/sklearn/feature_extraction/text.py", line 817, in fit_transform self.fixed_vocabulary_)  file "/home/b/work/local/lib/python2.7/site- packages/sklearn/feature_extraction/text.py", line 752, in _count_vocab    feature in analyze(doc):  file "/home/b/work/local/lib/python2.7/site-packages/sklearn/feature_extraction/text.py", line 238, in <lambda> tokenize(preprocess(self.decode(doc))), stop_words)  file "/home/b/work/local/lib/python2.7/site-packages/sklearn/feature_extraction/text.py", line 118, in decode  raise valueerror("np.nan invalid document, expected byte or "  valueerror: np.nan invalid document, expected byte or unicode string. 

i checked csv file , dataframe that's being read nan can't find anything. there 18000 rows, none of return isnan true.

this df['review'].head() looks like:

  0    book such life saver.  has been s...   1    bought few times older son and...   2    great basics, wish space...   3    book perfect!  i'm first time new mo...   4    during postpartum stay @ hospital th...   name: review, dtype: object 

you need convert dtype object unicode string mentioned in traceback.

x = v.fit_transform(df['review'].values.astype('u'))  ## astype(str) work 

from doc page of tfidf vectorizer:

fit_transform(raw_documents, y=none)

parameters: raw_documents : iterable
iterable yields either str, unicode or file objects


Comments