Hello everyone,
I need some help/ideas in calculating average depth per SNP across samples. I have already generated depth files for each sample (around 18000 samples in total) using samtools depth input.bam > output.txt
and I have merged all the output.txt's
into 1 txt.gz. The input file is sorted by position and it looks like this.
sample1 17 123456 1000
sample2 17 123456 500
sample3 17 123456 10
sampleN 17 123456 50
sample1 17 123457 1100
sampleN 17 123457 50
so first column contains sample names second column is chromosome number third column is the snp position and the last column is depth. I am just wondering if there is quick way to get average depth per SNP across samples as my input file is huge (around 38gb)? or if there is an alternate way?
Looking forward to your suggestions and ideas.
Regards,
38gb is really huge to read by any language (unless you employ some database techniques). The best option I see is using bedtools groupby function. Get the script ready with sample data and run it on your original file and take some rest.
yes I have already sorted the file based on chromosome/position. do you recommend splitting per chromosome? even if I do it I think it will still be quite large as I have 18k samples in there.
if its already sorted groupBy will help, except it might take long time. Try it.