R Programming: Identifying Composite Keys -
original question: job requires me profile data on large data sets - cardinality, relationships, uniqueness, . . . ., , often. intent use r profiling data , creating report in r markdown.
my problem is: 1. data loaded r data frames 2. how identify called composite primary key in database lingo?
for example, have 1 table 75,000 records. rapply unique counts on each variable. however, unless 1 of variables has count of 75,000 there no single primary key. in other words, there not 1 variable can used uniquely identify single observation.
the objective becomes looking combination of variables (columns) uniquely identify each observation (row). two, three, or 4 variables/columns out of 160 variable data frame/table. and, of course, there duplicates there no combination of keys uniquely identifies each row, or observation.
i have done 'for' loop (ugly), thinking there elegant, more efficient way of doing this.
how find variables constitute composite primary key?
modified question:
############### data1 - 2 columns - 1 pk data1 <- data.frame(rep_len(1, length.out = 10)) data1$pk <- rep_len(1:10, length.out = 10) names(data1) <- c('dupdata', 'pk') rownames(data1) <- null rapply(data1,function(x)length(unique(x)), how = 'unlist') dupdata pk 1 10 length(unique(data1$pk)) [1] 10
next data frame 3 columns, 2 columns required make unique observation:
############### data2 - 3 columns - 2 column composite pk data2 <- data1 data2$pk <- rep_len(1:2, length.out = 10) data2$pk2 <- rep_len(2:6, length.out = 10) rapply(data2,function(x)length(unique(x)), how = 'unlist') dupdata pk pk2 1 2 5 length(unique(data2$dupdata)) [1] 1 length(unique(data2$pk)) [1] 2 length(unique(data2$pk2)) [1] 5 nrow(unique(data2[,c(1,2)], nmax = 3)) [1] 2 nrow(unique(data2[,c(1,3)], nmax = 3)) [1] 5 nrow(unique(data2[,c(2,3)], nmax = 3)) [1] 10
lastly, there 1 data frame 4 columns/variables, , requires 3 columns make unique observation:
############### data3 - 4 columns - 3 column composite pk data3 <- data1 data3$pk <- c(0,0,0,0,0,0,0,0,1,1) data3$pk2 <- c(0,0,1,1,1,2,2,2,0,0) data3$pk3 <- c(1,2,0,1,2,0,1,2,0,1) rapply(data3,function(x)length(unique(x)), how = 'unlist') dupdata pk pk2 pk3 1 2 3 3 length(unique(data3$dupdata)) [1] 1 length(unique(data3$pk)) [1] 2 length(unique(data3$pk2)) [1] 3 length(unique(data3$pk3)) [1] 3 nrow(unique(data3[,c(1,2)], nmax = 4)) [1] 2 nrow(unique(data3[,c(1,3)], nmax = 4)) [1] 3 nrow(unique(data3[,c(1,4)], nmax = 4)) [1] 3 nrow(unique(data3[,c(1:4)], nmax = 4)) [1] 10 nrow(unique(data3[,c(2,3)], nmax = 4)) [1] 4 nrow(unique(data3[,c(2,4)], nmax = 4)) [1] 5 nrow(unique(data3[,c(3,4)], nmax = 4)) [1] 9 nrow(unique(data3[,c(2:4)], nmax = 4)) [1] 10
the question is: there way of determining columns combined constitute unique instance of record in simple, eloquent way, without writing endless loop?
if there not, best way write loop in r, tell every combination of columns combined, have unique count equal count of entire data frame?
hopefully, clearer mud, , simple problem someone.
thanks help!
unfortunately, no. there several ways identify primary key, whether single or composite key. however, if there 10 columns, in theory, 10 columns required make unique key. means have check uniqueness of first through 10th column, followed column 1 , column 2, followed column 1 , column 3, followed column 1 , column 4, . . . , on. think come down checking n! combinations. if have 10 columns, potentially have check uniqueness of 3,628,800 combinations. in reality, 2-4 columns max number of keys in composite key. however, can still large number of checks verify. opinion, boils down modeler knowing data , verifying assumptions. if find better answer, please let know,
Comments
Post a Comment