python - stemming Indonesian word with Sastrawi -


i have csv dataset value of data right here enter image description here

so, want preprocessing data. type of data text text mining. i'm confuse stemming. have try stemming data result count of word of news. code reference friend wanna change. wanna change code improve result. hope result count of word every each news, not split of news. plz me change code.

here code :

import os  import pandas pd  pandas import dataframe, read_csv    data = r'd:/skripsi/sample_200_data.csv'  df = pd.read_csv(data)    print "df", type (df['content']), "\n", df['content']  isiberita = df['content'].tolist()  print "df list isiberita ", isiberita, type(isiberita)  df.head()    ---------------------------------------------------------    import nltk  import string  import os  import pandas pd    sklearn.feature_extraction.text import tfidfvectorizer  sastrawi.stemmer.stemmerfactory import stemmerfactory  nltk.corpus import stopwords  collections import counter      path = 'd:/skripsi/sample_200_data.csv'  token_dict = {}    factory = stemmerfactory()  stemmer = factory.create_stemmer()    content_stemmed = map(lambda x: stemmer.stem(x), isiberita)  content_no_punc = map(lambda x: x.lower().translate(none, string.punctuation), content_stemmed)  content_final = []      news in content_no_punc:   	word_token = nltk.word_tokenize(news) # word token every news (split news each separate words)  	word_token = [word word in word_token if not word in nltk.corpus.stopwords.words('indonesian') , not word[0].isdigit()] # remove indonesian stop words , number  	content_final.append(" ".join(word_token))    counter = counter() # counter initiate  [counter.update(news.split()) news in content_final] # split every news counter of each words  print(counter.most_common(100)) 

so result of code :

[('indonesia', 202), ('rp', 179), ('jakarta', 160), ('usaha', 149), ('investasi', 136), ('laku', 124), ('ekonomi', 100), ('negara', 86), ('harga', 86), ('industri', 84), ('izin', 84), ('menteri', 83), ('listrik', 79), ('juta', 76), ('pasar', 73), ('tani', 71), ('uang', 71), ('koperasi', 71), ('target', 66), ('perintah', 66), ('saham', 65), ('miliar', 64), ('kerja', 63), ('sektor', 62), ('investor', 61), ('bangun', 60), ('produk', 60), ('pajak', 60), ('capai', 60), ('layan', 58), ('bank', 57), ('produksi', 57), ('modal', 57), ('turun', 57), ('china', 56), ('milik', 55), ('tingkat', 54), ('us', 54), ('triliun', 53), ('tumbuh', 53), ('bkpm', 53), ('impor', 52), ('kembang', 51), ('pt', 49), ('jalan', 49), ('dana', 48), ('bandara', 48), ('negeri', 46), ('rencana', 45), ('nilai', 45), ('temu', 44), ('salah', 42), ('proyek', 41), ('masuk', 41), ('desember', 40), ('langsung', 40), ('hasil', 39), ('butuh', 39), ('rupa', 38), ('biaya', 37), ('kapal', 37), ('rusia', 37), ('franky', 37), ('hadap', 36), ('kredit', 35), ('utama', 35), ('carrefour', 35), ('bijak', 35), ('ikan', 35), ('tanam', 35), ('atur', 34), ('persero', 34), ('kait', 34), ('jam', 34), ('masyarakat', 32), ('gas', 32), ('pakai', 32), ('dagang', 31), ('kondisi', 31), ('transmart', 31), ('lihat', 31), ('bisnis', 31), ('nggak', 31), ('kawasan', 30), ('dorong', 30), ('tutup', 30), ('banding', 30), ('batas', 30), ('terima', 30), ('cepat', 30), ('jasa', 30), ('ton', 29), ('the', 29), ('pln', 29), ('ekspor', 29), ('barel', 29), ('as', 29), ('rumah', 29), ('orang', 28), ('pondok', 28)]

i hope can me change code can result "count of word in every each news(content), not count word in news". thankyou.

if understand correctly, problem isn't directly related pysastrawi.

the problem use counter.update() while processing news data. this, in end, return accumulated count of words news. if want count words individual news separately, need separate instance of counter each news. following (this print 100 common words each news) :

for news in content_final:     counter = counter(news.split()) # counter initiate     print(counter.most_common(100)) 

complete demo example :

>>> content_final = ['foo','foo foo bar','foo baz baz'] >>> news in content_final: ...     counter = counter(news.split()) ...     print(counter.most_common(1)) ... [('foo', 1)] [('foo', 2)] [('baz', 2)] 

see live : https://eval.in/664688


Comments