aggregate - get max value per id, then only value per id R -

- September 15, 2011

i make df smaller taking 1 observation per person per date, based on persons biggest quantity per date.

here's df:

names      dates quantity 1    tom 2010-02-01       28 3    tom 2010-03-01        7 2   mary 2010-05-01       30 6    tom 2010-06-01       21 4   john 2010-07-01       45 5   mary 2010-07-01       30 8   mary 2010-07-01       28 11   tom 2010-08-01       28 7   john 2010-09-01       28 10  john 2010-09-01       30 9   john 2010-07-01       45 12  mary 2010-11-01       28 13  john 2010-12-01        7 14  john 2010-12-01       14

i first finding max quantity per person per date. works ok, can see, if person has equal quantities retain same amount of obs per date.

merge(df, aggregate(quantity ~ names+dates, df, max))     names      dates quantity 1   john 2010-07-01       45 2   john 2010-07-01       45 3   john 2010-09-01       30 4   john 2010-12-01       14 5   mary 2010-05-01       30 6   mary 2010-07-01       30 7   mary 2010-11-01       28 8    tom 2010-02-01       28 9    tom 2010-03-01        7 10   tom 2010-06-01       21 11   tom 2010-08-01       28

so, next step take first obs per date (given have selected biggest quantity). can't code right this. have tried:

merge(l, aggregate(names ~ dates, l, fun=function(z) z[1]))->m  ##doesn't rid of 1 obs john

and data.table option

l[, .sd[1], by=c(names,dates)]  ##doesn't work @

i aggregate , data.table options fast , df has ~100k rows.

thank in advance this!

solution

i posted fast - apologies!! easy solution problem find duplicates , remove those. e.g.,;

merge(df, aggregate(quantity ~ names+dates, df, max))->toy toy$dup<-duplicated(toy) toy<-toy[toy$dup!=true,]

here system times

 system.time(dt2[, max(new_quan), = list(hai_dispense_number, date_of_claim)]->method1)    user  system elapsed    20.04    0.04   20.07      system.time(aggregate(new_quan ~ hai_dispense_number+date_of_claim, dt2, max)->rpp)    user  system elapsed   19.129   0.028  19.148

here's data.table solution:

dt[, max(quantity), = list(names, dates)]

bench:

n = 1e6  dt = data.table(names = sample(letters, n, t), dates = sample(letters, n, t), quantity = rnorm(n)) df = data.frame(dt)  op = function(df) aggregate(quantity ~ names+dates, df, max)  eddi = function(dt) dt[, max(quantity), = list(names, dates)]  microbenchmark(op(df), eddi(dt), times = 10) #unit: milliseconds #     expr      min        lq   median        uq      max neval #   op(df) 2535.241 3025.1485 3195.078 3398.4404 3533.209    10 # eddi(dt)  148.088  162.8073  198.222  220.1217  286.058    10

Search This Blog

Sher

aggregate - get max value per id, then only value per id R -

Comments

Post a Comment

Popular posts from this blog

java - How to Configure JAXRS and Spring With Annotations -

visual studio - TFS will not accept changes I've made to a Java project -

php - Create image in codeigniter on the fly -