python - Analyzing huge dataset from 80GB file with only 32GB of RAM -


i have huge text file ( around 80g ). file contains numbers(integers+floats) , has 20 columns. have analyze each column. analyze mean, have basic calculations on each column finding mean, plotting histograms, check if condition satisfied or not etc. reading file following

with open(filename) original_file:         all_rows = [[float(digit) digit in line.split()] line in original_file]     all_rows = np.asarray(all_rows) 

after analysis on specific columns. use 'good' configuration server/workstation (with 32gb ram) execute program. problem not able finish job. waited day finish program still running after 1 day. had kill manually later on. know script correct without error because have tried same script on smaller size files (around 1g) , worked nicely.

my initial guess have memory problem. there way can run such job? different method or other way ?

i tried splitting files smaller file size , analyzed them individually in loop follows

pre_name = "split_file"    k in range(11):  #there 10 files 8g each         filename = pre_name+str(k).zfill(3) #my files in form "split_file000, split_file001 ..."         open(filename) original_file:             all_rows = [[float(digit) digit in line.split()] line in original_file]         all_rows = np.asarray(all_rows)         #some analysis here         plt.hist(all_rows[:,8],100)  #plotting histogram 9th column all_rows = none 

i have tested above code on bunch of smaller files , works fine. again same problem when used on big files. suggestions? there other cleaner way ?

for such lengthy operations (when data don't fit in memory), might useful use libraries dask ( http://dask.pydata.org/en/latest/ ), particularly dask.dataframe.read_csv read data , perform operations in pandas library (another useful package mention).


Comments

Popular posts from this blog

html - Outlook 2010 Anchor (url/address/link) -

javascript - Why does running this loop 9 times take 100x longer than running it 8 times? -

Getting gateway time-out Rails app with Nginx + Puma running on Digital Ocean -