i have genetic data. quite big, 17 000 genetic markers (snps) , 700 individuals. these snps can assigned founder. want calculate average probability per 'founder segment'. segment defined part of chromosome assigned 1 founder uninterrupted.
in example below have 3 segments.
in end want know average probability on snps within segment.
chromosome snp founder probability 1 1 7 0.6 1 2 7 0.5 1 3 7 0.7 1 4 2 0.5 1 5 2 0.8 1 6 7 0.6 1 7 7 0.5
i can group dplyr
, don't want first segment of founder 7 other segment founder 7.
so want:
chromosome snp founder probability average 1 1 7 0.6 0.6 1 2 7 0.5 0.6 1 3 7 0.7 0.6 1 4 2 0.5 0.65 1 5 2 0.8 0.65 1 6 7 0.6 0.55 1 7 7 0.5 0.55
how can calculate group mean when have same grouping factors several times?
with dplyr
can compare adjacent elements of 'founder' create grouping variable along 'chromosome', , mean
of 'probability'
library(dplyr) library(data.table) df1 %>% group_by(chromosome, grp1 = cumsum(founder!=lag(founder, default = founder[n()]))) %>% mutate(average = mean(probability)) # chromosome snp founder probability grp1 average # <int> <int> <int> <dbl> <int> <dbl> #1 1 1 7 0.6 0 0.60 #2 1 2 7 0.5 0 0.60 #3 1 3 7 0.7 0 0.60 #4 1 4 2 0.5 1 0.65 #5 1 5 2 0.8 1 0.65 #6 1 6 7 0.6 2 0.55 #7 1 7 7 0.5 2 0.55
or using data.table
, convert 'data.frame' 'data.table' (setdt(df1)
), grouped 'chromome' , run-length-type id (rleid
) of 'founder', assign (:=
) mean
of "probability" "average" column.
library(data.table) setdt(df1)[, average := mean(probability) , .(chromosome, grp1 = rleid(founder))]
Comments
Post a Comment