python - How to use `.split()` for importing tab-delimited text from a large `gzip` file? Chunks? -


i have huge gzip file (several gb) of tab-delimited text parse pandas dataframe.

if contents of file text, 1 use .split(), e.g.

file_text = """abc   123   cat   456   dog   678   bird   111   fish   ... moon   1969    revolution    1789   war   1927   reformation    1517    maxwell   ..."""  data = [line.split() line in file_text.split('\n')] 

and put data pandas dataframe using

import pandas pd df = pd.dataframe(data) 

however, isn't text document. tab-delimited file in gzip, several gb of data. efficient way parse data dataframe, using .split()?

i guess first step use

import gzip gzip.open(filename, 'r') f:     file_content = f.read() 

and use .split() on file_content, saving gb single variable , splitting inefficient. possible in "chunks"?

read_csv() supports gzipped files, can following:

for chunk in pd.read_csv('/path/to/file.csv.gz', sep='\s*', chunksize=10**5):     # process chunk df 

if sure have tsv (tab separated file), can use sep='\t'


Comments