i'm using tfidfvectorizer scikit-learn feature extraction text data. have csv file score (can +1 or -1) , review (text). pulled data dataframe can run vectorizer.
this code:
import pandas pd import numpy np sklearn.feature_extraction.text import tfidfvectorizer df = pd.read_csv("train_new.csv", names = ['score', 'review'], sep=',') # x = df['review'] == np.nan # # print x.to_csv(path='findnan.csv', sep=',', na_rep = 'string', index=true) # # print df.isnull().values.any() v = tfidfvectorizer(decode_error='replace', encoding='utf-8') x = v.fit_transform(df['review'])
this traceback error get:
traceback (most recent call last): file "/home/pycharmprojects/review/src/feature_extraction.py", line 16, in <module> x = v.fit_transform(df['review']) file "/home/b/hw1/local/lib/python2.7/site- packages/sklearn/feature_extraction/text.py", line 1305, in fit_transform x = super(tfidfvectorizer, self).fit_transform(raw_documents) file "/home/b/work/local/lib/python2.7/site-packages/sklearn/feature_extraction/text.py", line 817, in fit_transform self.fixed_vocabulary_) file "/home/b/work/local/lib/python2.7/site- packages/sklearn/feature_extraction/text.py", line 752, in _count_vocab feature in analyze(doc): file "/home/b/work/local/lib/python2.7/site-packages/sklearn/feature_extraction/text.py", line 238, in <lambda> tokenize(preprocess(self.decode(doc))), stop_words) file "/home/b/work/local/lib/python2.7/site-packages/sklearn/feature_extraction/text.py", line 118, in decode raise valueerror("np.nan invalid document, expected byte or " valueerror: np.nan invalid document, expected byte or unicode string.
i checked csv file , dataframe that's being read nan can't find anything. there 18000 rows, none of return isnan
true.
this df['review'].head()
looks like:
0 book such life saver. has been s... 1 bought few times older son and... 2 great basics, wish space... 3 book perfect! i'm first time new mo... 4 during postpartum stay @ hospital th... name: review, dtype: object
you need convert dtype object
unicode
string mentioned in traceback.
x = v.fit_transform(df['review'].values.astype('u')) ## astype(str) work
from doc page of tfidf vectorizer:
fit_transform(raw_documents, y=none)
parameters: raw_documents : iterable
iterable yields either str, unicode or file objects
Comments
Post a Comment