generic scoring module using python pandas -

May 15, 2013

hi i'm trying develop generic scoring module grading students based on variety of attributes. i'm trying develop generic method using python pandas input: input data frame student id , ug major , attributes scoring (i called df_input) input ref. data frame contains scoring params

process: based on variable type, developing process calculate scores each attribute

output: input data frame added cols capture attribute score example:

df_input

------------+-----------+----+------------+-----+------+ | student_id | ug_major  | c1 |     c2     | c3  |  c4  | +------------+-----------+----+------------+-----+------+ |        123 | math      |  | 8000-10000 | 12% | 9000 | |        234 | all_other | b  | 1500-2000  | 10% | 1500 | |        345 | all_other |  | 2800-3000  | 8%  | 2300 | |        456 | all_other |  | 8000-10000 | 12% | 3200 | |        980 | all_other | c  | 1000-2500  | 15% | 2700 | +------------+-----------+----+------------+-----+------+

df_ref +

---------+---------+---------+ | ref_col | ref_val | ref_scr | +---------+---------+---------+ | c1      |       |      10 | | c1      | b       |      20 | | c1      | c       |      30 | | c1      | null    |       0 | | c1      | missing |       0 | | c1      |       |      20 | | c1      | b       |      30 | | c1      | c       |      40 | | c1      | null    |      10 | | c1      | missing |      10 | | c2      | <1000   |       0 | | c2      | >1000   |      20 | | c2      | >7000   |      30 | | c2      | >9500   |      40 | | c2      | missing |       0 | | c2      | null    |       0 | | c3      | <3%     |       5 | | c3      | >3%     |      10 | | c3      | >5%     |     100 | | c3      | >7%     |     200 | | c3      | >10%    |     300 | | c3      | null    |       0 | | c3      | missing |       0 | | c4      | <5000   |      10 | | c4      | >5000   |      20 | | c4      | >10000  |      30 | | c4      | >15000  |      40 | +---------+---------+---------+  +------------+-----------+----+------------+-----+------+--------+--------+--------+---------+ | req.output |           |    |            |     |      |        |        |        |         | +------------+-----------+----+------------+-----+------+--------+--------+--------+---------+ | student_id | ug_major  | c1 | c2         | c3  | c4   | c1_scr | c2_scr | c3_scr | tot_scr | | 123        | math      |  | 8000-10000 | 12% | 9000 |        |        |        |         | | 234        | all_other | b  | 1500-2000  | 10% | 1500 |        |        |        |         | | 345        | all_other |  | 2800-3000  | 8%  | 2300 |        |        |        |         | | 456        | all_other |  | 8000-10000 | 12% | 3200 |        |        |        |         | | 980        | all_other | c  | 1000-2500  | 15% | 2700 |        |        |        |         | +------------+-----------+----+------------+-----+------+--------+--------+--------+---------+

i want see if thing function developed accomplish this

thank pari

if understand question correctly, trying store collection of rules in df_ref applied df_input generate scores. while can done, should make sure rules defined. guide in writing corresponding scoring function.

for instance, suppose 1 of students gets value of 10000 in column c3. 10000 larger 1000, 7000 , 9500. means score ambiguous. suppose want choose highest of scores particular column. then, need table specifying choice rule each column when multiple scores selected.

second, should think type of python variable stored in 'ref_val' column. if >7000 string, have work determine score. consider storing 7000 instead , specifying comparison operator elsewhere.

finally, looking @ current rules, there seems pattern. each score associated null, missing or range cutoff. can captured follows:

import pandas pd import numpy np itertools import dropwhile  # stores values , scores special values , cutoff values sample_range_rule = {     'missing' : 0,     'null'    : 0,     'vals'    : [         (0, 0),         (10, 50),         (70, 75),         (90, 100),         (100, 100)     ] }  # takes dict rules , produces scoring function def getscoringfunction(range_rule):     def score(val):         if val == 'missing':             return range_rule['missing']         elif val == 'null':             return range_rule['null']         else:             return dropwhile(lambda (cutoff, score): cutoff < val,                 range_rule['vals']).next()[1]     return score  sample_scoring_function = getscoringfunction(sample_range_rule)  test_value in ['missing', 'null', 0, 12, 55, 66, 99]:     print 'input', test_value,     print 'output', sample_scoring_function(test_value)

after have dict specifying rule every column, can following:

df['ck_scr'] = df['ck'].apply(getscoringfunction(ck_dct))

converting pandas dataframe 2 columns dict of form should not difficult.

Search This Blog

TSQL

generic scoring module using python pandas -

Comments

Post a Comment

Popular posts from this blog

java - WARN : org.springframework.web.servlet.PageNotFound - No mapping found for HTTP request with URI [/board/] in DispatcherServlet with name 'appServlet' -

android - How to create dynamically Fragment pager adapter -

1111. appearing after print sequence - php -