Hi,
We have generated a set of RNA-seq samples from blood tissue which are non globin depleted. I want to perform the RNA-seq alignment against a set of highly abundant globin genes (HBA1, HBA2, and HBB) and identity the percentage of globin reads mapping to these genes and exclude it from the analysis. After this, extract the unaligned reads and map to the hg19 genome and perform quantification to identify number of genes or transcripts. Please let me know most appropriate tool and methods to perform.
Why not just do the alignment / pseudo alignment as standard and then check the expression over these genes, as I did here (using transformed, normalised counts):
These are different studies. The third study we knew had a mixture of globin- and non-globin depleted samples. The ones in red were the ones that were suspected as being non-globin depleted.
I feel that, by actually aligning, filtering out reads, and then re-aligning, you will be introducing bias into your data.
Edit: to complete my comment: after you do this, you can selectively exclude the raw count data from the globin genes prior to normalisation. There are likely many ways of dealing with this issue, though.
Thank you Kevin. This was helpful.
From the provided comments, I understand that, I need to first align my non globin depleted samples against whole genome hg19 build. Post this, perform quantification and obtain the expression of these genes. Could you please provide any material or publication related to this. Thank you.
A good overview of all steps involved in a typical RNA-seq analysis can be found at bioconductor, for example Michael Love's tutorial
Thank you all for the comments. It was helpful.
mohammedtoufiq91 : If a specific comment in this thread was helpful in solving your question let us know and we can move it to an answer so you can accept it.
If an answer was helpful, you should upvote it; if the answer resolved your question, you should mark it as accepted. You can accept more than one if they work.