python - Update list of unique items in a dictionary -
i creating json file have unique values each column of csv. doing right generating dictionary unique values of each column stored separate entry (the column name being key).
i have download new version of csv regularly , update meta-data json. current plan download latest update csv (we’re using elastic search), read off unique values csv, update meta-data json, , concatenate new , old csv’s.
questions:
- is there more efficient way this? old csv ~10gb, 51m rows, 1400 columns; takes day generate json. here’s current code:
.
import pandas pd import numpy np import datetime import json filename = sys.argv[1] json_file = sys.argv[2] def get_col_stats(colname, numrows=none): print('start reading ' + colname) df = pd.read_csv(filename, engine='c', usecols=[colname], nrows = numrows) print('finished reading ' + colname) df.columns = ['col'] uniq = list(df.col.unique()) count = len(uniq) print('unique count is', count, '\n') if colname in ['orderyear', 'faultdate', 'faultactivetime']: return {'type': 'date', 'min': df.col.dropna().min(), 'max': df.col.dropna().max()} elif count < 1000 or colname == 'faultcode': return {'type': 'factor', 'uniq': uniq} else: return {'type': 'continuous', 'min': df.col.dropna().min(), 'max': df.col.dropna().max()} def default(o): if isinstance(o, np.integer): return int(o) raise typeerror col_list = list(pd.read_csv(filename, nrows=1).columns) print(col_list[1:50]) d = {} in col_list: d[i] = get_col_stats(i, numrows=none) print('made ' + i) open(json_file, 'w') fp: json.dump(d, fp, default=default) - is there better way update dictionary unique values this:
.
dic = {'a': [1,2,3], 'b': [3,4,5]} dic['a'].extend([2,3,4]) dic['a'] = list(set(dic['a'])) dic
not sure on first question, i'm not familiar pandas. question 2, it's easier do:
dic = {'a': [1,2,3], 'b': [3,4,5]} dic['a'] = list(set(dic['a'] + [2,3,4])) dic
Comments
Post a Comment