generic scoring module using python pandas -


hi i'm trying develop generic scoring module grading students based on variety of attributes. i'm trying develop generic method using python pandas input: input data frame student id , ug major , attributes scoring (i called df_input) input ref. data frame contains scoring params

process: based on variable type, developing process calculate scores each attribute

output: input data frame added cols capture attribute score example:

df_input

+

------------+-----------+----+------------+-----+------+ | student_id | ug_major  | c1 |     c2     | c3  |  c4  | +------------+-----------+----+------------+-----+------+ |        123 | math      |  | 8000-10000 | 12% | 9000 | |        234 | all_other | b  | 1500-2000  | 10% | 1500 | |        345 | all_other |  | 2800-3000  | 8%  | 2300 | |        456 | all_other |  | 8000-10000 | 12% | 3200 | |        980 | all_other | c  | 1000-2500  | 15% | 2700 | +------------+-----------+----+------------+-----+------+ 

df_ref +

---------+---------+---------+ | ref_col | ref_val | ref_scr | +---------+---------+---------+ | c1      |       |      10 | | c1      | b       |      20 | | c1      | c       |      30 | | c1      | null    |       0 | | c1      | missing |       0 | | c1      |       |      20 | | c1      | b       |      30 | | c1      | c       |      40 | | c1      | null    |      10 | | c1      | missing |      10 | | c2      | <1000   |       0 | | c2      | >1000   |      20 | | c2      | >7000   |      30 | | c2      | >9500   |      40 | | c2      | missing |       0 | | c2      | null    |       0 | | c3      | <3%     |       5 | | c3      | >3%     |      10 | | c3      | >5%     |     100 | | c3      | >7%     |     200 | | c3      | >10%    |     300 | | c3      | null    |       0 | | c3      | missing |       0 | | c4      | <5000   |      10 | | c4      | >5000   |      20 | | c4      | >10000  |      30 | | c4      | >15000  |      40 | +---------+---------+---------+  +------------+-----------+----+------------+-----+------+--------+--------+--------+---------+ | req.output |           |    |            |     |      |        |        |        |         | +------------+-----------+----+------------+-----+------+--------+--------+--------+---------+ | student_id | ug_major  | c1 | c2         | c3  | c4   | c1_scr | c2_scr | c3_scr | tot_scr | | 123        | math      |  | 8000-10000 | 12% | 9000 |        |        |        |         | | 234        | all_other | b  | 1500-2000  | 10% | 1500 |        |        |        |         | | 345        | all_other |  | 2800-3000  | 8%  | 2300 |        |        |        |         | | 456        | all_other |  | 8000-10000 | 12% | 3200 |        |        |        |         | | 980        | all_other | c  | 1000-2500  | 15% | 2700 |        |        |        |         | +------------+-----------+----+------------+-----+------+--------+--------+--------+---------+ 

i want see if thing function developed accomplish this

thank pari

if understand question correctly, trying store collection of rules in df_ref applied df_input generate scores. while can done, should make sure rules defined. guide in writing corresponding scoring function.

for instance, suppose 1 of students gets value of 10000 in column c3. 10000 larger 1000, 7000 , 9500. means score ambiguous. suppose want choose highest of scores particular column. then, need table specifying choice rule each column when multiple scores selected.

second, should think type of python variable stored in 'ref_val' column. if >7000 string, have work determine score. consider storing 7000 instead , specifying comparison operator elsewhere.

finally, looking @ current rules, there seems pattern. each score associated null, missing or range cutoff. can captured follows:

import pandas pd import numpy np itertools import dropwhile  # stores values , scores special values , cutoff values sample_range_rule = {     'missing' : 0,     'null'    : 0,     'vals'    : [         (0, 0),         (10, 50),         (70, 75),         (90, 100),         (100, 100)     ] }  # takes dict rules , produces scoring function def getscoringfunction(range_rule):     def score(val):         if val == 'missing':             return range_rule['missing']         elif val == 'null':             return range_rule['null']         else:             return dropwhile(lambda (cutoff, score): cutoff < val,                 range_rule['vals']).next()[1]     return score  sample_scoring_function = getscoringfunction(sample_range_rule)  test_value in ['missing', 'null', 0, 12, 55, 66, 99]:     print 'input', test_value,     print 'output', sample_scoring_function(test_value) 

after have dict specifying rule every column, can following:

df['ck_scr'] = df['ck'].apply(getscoringfunction(ck_dct))

converting pandas dataframe 2 columns dict of form should not difficult.


Comments

Popular posts from this blog

html - Outlook 2010 Anchor (url/address/link) -

javascript - Why does running this loop 9 times take 100x longer than running it 8 times? -

Getting gateway time-out Rails app with Nginx + Puma running on Digital Ocean -