generic scoring module using python pandas -
hi i'm trying develop generic scoring module grading students based on variety of attributes. i'm trying develop generic method using python pandas input: input data frame student id , ug major , attributes scoring (i called df_input) input ref. data frame contains scoring params
process: based on variable type, developing process calculate scores each attribute
output: input data frame added cols capture attribute score example:
df_input
+
------------+-----------+----+------------+-----+------+ | student_id | ug_major | c1 | c2 | c3 | c4 | +------------+-----------+----+------------+-----+------+ | 123 | math | | 8000-10000 | 12% | 9000 | | 234 | all_other | b | 1500-2000 | 10% | 1500 | | 345 | all_other | | 2800-3000 | 8% | 2300 | | 456 | all_other | | 8000-10000 | 12% | 3200 | | 980 | all_other | c | 1000-2500 | 15% | 2700 | +------------+-----------+----+------------+-----+------+
df_ref +
---------+---------+---------+ | ref_col | ref_val | ref_scr | +---------+---------+---------+ | c1 | | 10 | | c1 | b | 20 | | c1 | c | 30 | | c1 | null | 0 | | c1 | missing | 0 | | c1 | | 20 | | c1 | b | 30 | | c1 | c | 40 | | c1 | null | 10 | | c1 | missing | 10 | | c2 | <1000 | 0 | | c2 | >1000 | 20 | | c2 | >7000 | 30 | | c2 | >9500 | 40 | | c2 | missing | 0 | | c2 | null | 0 | | c3 | <3% | 5 | | c3 | >3% | 10 | | c3 | >5% | 100 | | c3 | >7% | 200 | | c3 | >10% | 300 | | c3 | null | 0 | | c3 | missing | 0 | | c4 | <5000 | 10 | | c4 | >5000 | 20 | | c4 | >10000 | 30 | | c4 | >15000 | 40 | +---------+---------+---------+ +------------+-----------+----+------------+-----+------+--------+--------+--------+---------+ | req.output | | | | | | | | | | +------------+-----------+----+------------+-----+------+--------+--------+--------+---------+ | student_id | ug_major | c1 | c2 | c3 | c4 | c1_scr | c2_scr | c3_scr | tot_scr | | 123 | math | | 8000-10000 | 12% | 9000 | | | | | | 234 | all_other | b | 1500-2000 | 10% | 1500 | | | | | | 345 | all_other | | 2800-3000 | 8% | 2300 | | | | | | 456 | all_other | | 8000-10000 | 12% | 3200 | | | | | | 980 | all_other | c | 1000-2500 | 15% | 2700 | | | | | +------------+-----------+----+------------+-----+------+--------+--------+--------+---------+
i want see if thing function developed accomplish this
thank pari
if understand question correctly, trying store collection of rules in df_ref
applied df_input
generate scores. while can done, should make sure rules defined. guide in writing corresponding scoring function.
for instance, suppose 1 of students gets value of 10000
in column c3
. 10000
larger 1000
, 7000
, 9500
. means score ambiguous. suppose want choose highest of scores particular column. then, need table specifying choice rule each column when multiple scores selected.
second, should think type of python variable stored in 'ref_val' column. if >7000
string, have work determine score. consider storing 7000
instead , specifying comparison operator elsewhere.
finally, looking @ current rules, there seems pattern. each score associated null
, missing
or range cutoff. can captured follows:
import pandas pd import numpy np itertools import dropwhile # stores values , scores special values , cutoff values sample_range_rule = { 'missing' : 0, 'null' : 0, 'vals' : [ (0, 0), (10, 50), (70, 75), (90, 100), (100, 100) ] } # takes dict rules , produces scoring function def getscoringfunction(range_rule): def score(val): if val == 'missing': return range_rule['missing'] elif val == 'null': return range_rule['null'] else: return dropwhile(lambda (cutoff, score): cutoff < val, range_rule['vals']).next()[1] return score sample_scoring_function = getscoringfunction(sample_range_rule) test_value in ['missing', 'null', 0, 12, 55, 66, 99]: print 'input', test_value, print 'output', sample_scoring_function(test_value)
after have dict specifying rule every column, can following:
df['ck_scr'] = df['ck'].apply(getscoringfunction(ck_dct))
converting pandas dataframe 2 columns dict of form should not difficult.
Comments
Post a Comment