aggregate - get max value per id, then only value per id R -
i make df smaller taking 1 observation per person per date, based on persons biggest quantity per date.
here's df:
names dates quantity 1 tom 2010-02-01 28 3 tom 2010-03-01 7 2 mary 2010-05-01 30 6 tom 2010-06-01 21 4 john 2010-07-01 45 5 mary 2010-07-01 30 8 mary 2010-07-01 28 11 tom 2010-08-01 28 7 john 2010-09-01 28 10 john 2010-09-01 30 9 john 2010-07-01 45 12 mary 2010-11-01 28 13 john 2010-12-01 7 14 john 2010-12-01 14
i first finding max quantity per person per date. works ok, can see, if person has equal quantities retain same amount of obs per date.
merge(df, aggregate(quantity ~ names+dates, df, max)) names dates quantity 1 john 2010-07-01 45 2 john 2010-07-01 45 3 john 2010-09-01 30 4 john 2010-12-01 14 5 mary 2010-05-01 30 6 mary 2010-07-01 30 7 mary 2010-11-01 28 8 tom 2010-02-01 28 9 tom 2010-03-01 7 10 tom 2010-06-01 21 11 tom 2010-08-01 28
so, next step take first obs per date (given have selected biggest quantity). can't code right this. have tried:
merge(l, aggregate(names ~ dates, l, fun=function(z) z[1]))->m ##doesn't rid of 1 obs john
and data.table option
l[, .sd[1], by=c(names,dates)] ##doesn't work @
i aggregate , data.table options fast , df has ~100k rows.
thank in advance this!
solution
i posted fast - apologies!! easy solution problem find duplicates , remove those. e.g.,;
merge(df, aggregate(quantity ~ names+dates, df, max))->toy toy$dup<-duplicated(toy) toy<-toy[toy$dup!=true,]
here system times
system.time(dt2[, max(new_quan), = list(hai_dispense_number, date_of_claim)]->method1) user system elapsed 20.04 0.04 20.07 system.time(aggregate(new_quan ~ hai_dispense_number+date_of_claim, dt2, max)->rpp) user system elapsed 19.129 0.028 19.148
here's data.table
solution:
dt[, max(quantity), = list(names, dates)]
bench:
n = 1e6 dt = data.table(names = sample(letters, n, t), dates = sample(letters, n, t), quantity = rnorm(n)) df = data.frame(dt) op = function(df) aggregate(quantity ~ names+dates, df, max) eddi = function(dt) dt[, max(quantity), = list(names, dates)] microbenchmark(op(df), eddi(dt), times = 10) #unit: milliseconds # expr min lq median uq max neval # op(df) 2535.241 3025.1485 3195.078 3398.4404 3533.209 10 # eddi(dt) 148.088 162.8073 198.222 220.1217 286.058 10
Comments
Post a Comment