Hi guys,
I have a rather complex or maybe to some of you naive question. I go Illumina pe 100bp sequencing which targeted human DNA. Now there is a noteable protion of reads which do not map to the human genome (and this was intended). Anyways, i mapped those onto a custom inhouse db of humand pathogen / viruses / archeae etc. This was done for 3 groups including several replicates. (Control, mild disease, severe disease). After looking at the reads etc. I'm thinking about investigating the difference in abundace of species in those three samples. Unfortunately the sequencing depth is not really equal across samples there I need to somehow normalize the data which is in fact reads assigned to a certain species per sample. I tried the 'usual' way of tpmt (transcripts per millions transcripts) but this does not work very well since many organisms which are of high interest only map with like 100-200 reads. Trying to infere something statistical is not possible since every count gets so low. I tried lowering the tmpt value to say 1000 but I am not sure whether this is acutally valid.
Do you guys have any recommends of how to possibly normalize such (count)data?
Thanks,
phil
Yep it is RNA-Seq. It's just kind of data which was sitting around. And I wanted to try out stuff since we are going to sequence 16S just this week. @cwarden45 counts-per-thousand seems pretty good.
edit: thank you!