i have csv dataset value of data right here enter image description here
so, want preprocessing data. type of data text text mining. i'm confuse stemming. have try stemming data result count of word of news. code reference friend wanna change. wanna change code improve result. hope result count of word every each news, not split of news. plz me change code.
here code :
import os import pandas pd pandas import dataframe, read_csv data = r'd:/skripsi/sample_200_data.csv' df = pd.read_csv(data) print "df", type (df['content']), "\n", df['content'] isiberita = df['content'].tolist() print "df list isiberita ", isiberita, type(isiberita) df.head() --------------------------------------------------------- import nltk import string import os import pandas pd sklearn.feature_extraction.text import tfidfvectorizer sastrawi.stemmer.stemmerfactory import stemmerfactory nltk.corpus import stopwords collections import counter path = 'd:/skripsi/sample_200_data.csv' token_dict = {} factory = stemmerfactory() stemmer = factory.create_stemmer() content_stemmed = map(lambda x: stemmer.stem(x), isiberita) content_no_punc = map(lambda x: x.lower().translate(none, string.punctuation), content_stemmed) content_final = [] news in content_no_punc: word_token = nltk.word_tokenize(news) # word token every news (split news each separate words) word_token = [word word in word_token if not word in nltk.corpus.stopwords.words('indonesian') , not word[0].isdigit()] # remove indonesian stop words , number content_final.append(" ".join(word_token)) counter = counter() # counter initiate [counter.update(news.split()) news in content_final] # split every news counter of each words print(counter.most_common(100))
so result of code :
[('indonesia', 202), ('rp', 179), ('jakarta', 160), ('usaha', 149), ('investasi', 136), ('laku', 124), ('ekonomi', 100), ('negara', 86), ('harga', 86), ('industri', 84), ('izin', 84), ('menteri', 83), ('listrik', 79), ('juta', 76), ('pasar', 73), ('tani', 71), ('uang', 71), ('koperasi', 71), ('target', 66), ('perintah', 66), ('saham', 65), ('miliar', 64), ('kerja', 63), ('sektor', 62), ('investor', 61), ('bangun', 60), ('produk', 60), ('pajak', 60), ('capai', 60), ('layan', 58), ('bank', 57), ('produksi', 57), ('modal', 57), ('turun', 57), ('china', 56), ('milik', 55), ('tingkat', 54), ('us', 54), ('triliun', 53), ('tumbuh', 53), ('bkpm', 53), ('impor', 52), ('kembang', 51), ('pt', 49), ('jalan', 49), ('dana', 48), ('bandara', 48), ('negeri', 46), ('rencana', 45), ('nilai', 45), ('temu', 44), ('salah', 42), ('proyek', 41), ('masuk', 41), ('desember', 40), ('langsung', 40), ('hasil', 39), ('butuh', 39), ('rupa', 38), ('biaya', 37), ('kapal', 37), ('rusia', 37), ('franky', 37), ('hadap', 36), ('kredit', 35), ('utama', 35), ('carrefour', 35), ('bijak', 35), ('ikan', 35), ('tanam', 35), ('atur', 34), ('persero', 34), ('kait', 34), ('jam', 34), ('masyarakat', 32), ('gas', 32), ('pakai', 32), ('dagang', 31), ('kondisi', 31), ('transmart', 31), ('lihat', 31), ('bisnis', 31), ('nggak', 31), ('kawasan', 30), ('dorong', 30), ('tutup', 30), ('banding', 30), ('batas', 30), ('terima', 30), ('cepat', 30), ('jasa', 30), ('ton', 29), ('the', 29), ('pln', 29), ('ekspor', 29), ('barel', 29), ('as', 29), ('rumah', 29), ('orang', 28), ('pondok', 28)]
i hope can me change code can result "count of word in every each news(content), not count word in news". thankyou.
if understand correctly, problem isn't directly related pysastrawi.
the problem use counter.update()
while processing news data. this, in end, return accumulated count of words news. if want count words individual news separately, need separate instance of counter
each news. following (this print 100 common words each news) :
for news in content_final: counter = counter(news.split()) # counter initiate print(counter.most_common(100))
complete demo example :
>>> content_final = ['foo','foo foo bar','foo baz baz'] >>> news in content_final: ... counter = counter(news.split()) ... print(counter.most_common(1)) ... [('foo', 1)] [('foo', 2)] [('baz', 2)]
see live : https://eval.in/664688
Comments
Post a Comment