I tried searching and did not find relevant Q.
The problem is simple, what are the unmapped reads and how to quantify them. The unmapped reads could be contamination, polyA, some viral or bacterial sequence, or something else!
I have usually seen reads around 5% from DNA and upto 40% from RNA seq being unmapped. The numbers are high for chipSEQ and miRNA-seq as well. Some of this could be due to inefficient mapping or low quality data as well. Doing a BLAST against NR for all the unmapped reads is used but blast is terribly slow.
Either ways, looking for any resources or papers in this regard.
Thanks!
More details
Organism: Human
Data type: DNA, RNA, ChipSEQ (I understand RNA will have more un-mapped reads due to splice junction mapping, etc)
Reference: hg19 all chr (using topHat for rna data)
no preprocessing
I guess most of you are listing some or the other steps, but was hoping to get a comprehensive solution that can be implemented, so essentially all the sequenced reads are accounted for.
What kind of data do you have? I am assuming RNA-seq since you mentioned polyA. What are you aligning against? What you align against will be a factor in how many unmapped reads you get. do you use hg19 chr1-22,X,Y,M or do you also include supercontigs. Do you do any preprocessing to remove artifacts before aligning? What is your organism? Your question is missing a lot of important details and you shoud edit it to make these points more clear.