Question

average depth across samples

1

Entering edit mode

5.1 years ago

ijlal.hyder2012 ▴ 20

Hello everyone,

I need some help/ideas in calculating average depth per SNP across samples. I have already generated depth files for each sample (around 18000 samples in total) using samtools depth input.bam > output.txt and I have merged all the output.txt's into 1 txt.gz. The input file is sorted by position and it looks like this.

sample1   17   123456   1000
sample2   17   123456   500
sample3   17   123456   10
sampleN   17   123456   50
sample1   17   123457   1100
sampleN   17   123457   50

so first column contains sample names second column is chromosome number third column is the snp position and the last column is depth. I am just wondering if there is quick way to get average depth per SNP across samples as my input file is huge (around 38gb)? or if there is an alternate way?

Looking forward to your suggestions and ideas.

Regards,

snp bash perl • 2.3k views

ADD COMMENT • link 5.1 years ago by ijlal.hyder2012 ▴ 20

0

Entering edit mode

38gb is really huge to read by any language (unless you employ some database techniques). The best option I see is using bedtools groupby function. Get the script ready with sample data and run it on your original file and take some rest.

ADD REPLY • link 5.1 years ago by venu 7.1k

0

Entering edit mode

yes I have already sorted the file based on chromosome/position. do you recommend splitting per chromosome? even if I do it I think it will still be quite large as I have 18k samples in there.

ADD REPLY • link 5.1 years ago by ijlal.hyder2012 ▴ 20

0

Entering edit mode

if its already sorted groupBy will help, except it might take long time. Try it.

ADD REPLY • link 5.1 years ago by GouthamAtla 12k

0

Entering edit mode

5.1 years ago

GouthamAtla 12k

I would do it per chromosome instead of merging ALL the data and trying to do some calculations. This will make your 38GB file to less than 2GB per chromosome on average.

If you want to do on your current file, if its already sorted by chromosome/position that might ease things with groupBy , otherwise I am not sure if you will be able to sort that file intact.

Ideally, it should be a merged VCF file instead of TXT file to make any reliable per SNP based calculations. You may not want to hold the information at each and every base pair, especially if those bases are not reliable variants.

ADD COMMENT • link 5.1 years ago by GouthamAtla 12k

0

Entering edit mode

calculating average depth per SNP across samples

Looks like the file is already sorted appropriately based on example above.

sample1   17   123456   1000
sample2   17   123456   500
sample3   17   123456   10
sampleN   17   123456   50

ADD REPLY • link 5.1 years ago by GenoMax 151k

0

Entering edit mode

In that case groupBy will help, except it might take long time.

ADD REPLY • link 5.1 years ago by GouthamAtla 12k

score 1 · Accepted Answer · 2020-04-28

You can try with datamash:

$ datamash -g 2,3 mean 4 <test.txt                                                                                             
17  123456  390
17  123457  575

input:

$ cat test.txt                                                                                                                 
sample1 17  123456  1000
sample2 17  123456  500
sample3 17  123456  10
sampleN 17  123456  50
sample1 17  123457  1100
sampleN 17  123457  50

But 38 gb file is huge to read.

If you want to experiment with gnu-parallel and tsv-utils from ebay, please go through the section of 'GNU parallel and tsv-summarize' from : https://github.com/eBay/tsv-utils/blob/master/docs/TipsAndTricks.md. Appropriate function is:

parallel tsv-summarize  --group-by 2,3  --mean 4 ::: test.txt

First try with a smaller set and keep a back up (even if the file is huge)