Question

Subsampling a taxa abundance matrix

0

Entering edit mode

8.9 years ago

bioinfo ▴ 840

Hi

I have an abundance matrix for taxonomic composition from large numbers of shotgun metagenomes that had a sequence range from 5 million to 99 million. Here is the test raw abundance data of these taxa for 4 samples.

Sample_ID total_sequences Escherichia Pseudomona Bacillus Salmonella   Yersinia  Klesiella
sample1   13,000,000 8    13   6    13   32    0     28
sample2   60,000,000 31  25   0      0   25   19      0
sample3    5,000,000 0    0   9     51    0     0    40
sample4   99,000,000 27   19  0     0    22   32      0

I Want to subsample these raw abundance matrix data to 5 million reads and get a new subsamples-abundance matrix. I thought to subsample the first 5 million reads or randomly selected 5 million reads using Heng Li's seqtk and then run those 5 million reads for taxonomic abundance. But that's a time consing process to rerun so many metagenomes again using 5 million reads this time, so I don't want to do that. Can I just calculate a revised taxonomic abundance for 5 million reads for each sample from the matrix that I already have by using this simple calculation.

revised count = raw count/total sequences * 5,000,000

subsampling taxa • 1.7k views

ADD COMMENT • link updated 2.5 years ago by Ram 44k • written 8.9 years ago by bioinfo ▴ 840