hadoop - Sampling Issue with hive -
"all_members" table in hive 10m rows , 1 column: "membership_nbr". want sample 3000 rows. have done:
hive>create table sample_members select * all_members limit 1; hive>insert overwrite table sample_members select membership_nbr all_members tablesample(3000 rows); hive>select count(*) sample_members; ok 45000
the result wont change if replace 3000 rows 300 rows do wrong?
table sampling using tablesample(3000 rows) wont fetch 3000 rows entire table instead fetch 3000 rows each input split.
so, query might run 15 mappers. so, each mapper fetch 3000 rows. totally, 3000 * 15 = 45000 rows. also, if change 3000 rows 300 rows 4500 rows output after sampling.
so, per requirement have give tablesample(200 rows). result each mapper fetch 200 rows. finally, 15 mappers fetch 3000 sampling rows.
refer below link various types of sampling: https://cwiki.apache.org/confluence/display/hive/languagemanual+sampling
Comments
Post a Comment