Hi
I have an abundance matrix for taxonomic composition from large numbers of shotgun metagenomes that had a sequence range from 5 million to 99 million. Here is the test raw abundance data of these taxa for 4 samples.
Sample_ID total_sequences Escherichia Pseudomona Bacillus Salmonella Yersinia Klesiella
sample1 13,000,000 8 13 6 13 32 0 28
sample2 60,000,000 31 25 0 0 25 19 0
sample3 5,000,000 0 0 9 51 0 0 40
sample4 99,000,000 27 19 0 0 22 32 0
I Want to subsample these raw abundance matrix data to 5 million reads and get a new subsamples-abundance matrix. I thought to subsample the first 5 million reads or randomly selected 5 million reads using Heng Li's seqtk and then run those 5 million reads for taxonomic abundance. But that's a time consing process to rerun so many metagenomes again using 5 million reads this time, so I don't want to do that. Can I just calculate a revised taxonomic abundance for 5 million reads for each sample from the matrix that I already have by using this simple calculation.
revised count = raw count/total sequences * 5,000,000