Hello everyone,
One of the main analysis in Bioinformatics today is RNA-seq data processing and one of the first step is to align or map (I will talk about alignment here) reads against a reference genome or transcriptome.
I work on mouse, note that my question is applicable to well known species too. I retrieve my genome from GRC
In this file, I have listed all the entries that I classified as follow (in mouse) helping myself with this documentation :
- "Conventional chromosomes" (chr1-19, chrX, chrY)
- Primary assembly (chr1-19, chrX, chrY + unlocalized sequences (JH584293.1), unplaced sequences (GL456394.1))
- Genome Patches (Fixed patch (KV575232.1), Novel patch (KK082441.1))
- "Unknown from NCBI" (WSB_EIJ_MMCHR11_CTG1)
From yours experiences, just before the alignment, in which case do you filter out patched chromosomes, unlocalized sequences, unplaced sequences or unknown sequences (let's called all of these terms : not conventional chromosomes) and in which case you do not ?
To conclude on my RNA-seq data processing, I want to keep as much reads as possible on the "Conventional chromosomes" to create Circos Plot.
Related posts:
See also this blog post of Heng Li: Which human reference genome to use?
Until now I use Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz for mapping of my mRNA-Seq data for alternative splicing analysis. About this file the blog says:
Would switching the reference genome in my case lead to a higher mapping fidelity? Would this differences affect my analysis?
Thanks for this link, that helps me to see that filter out chromosomes is dangerous on variant calling due to false positive but does not help me much on RNA-seq data. Does everyone has his own way to do it ?
Unless you are interested in haplotype specific expression and/or those other regions you could just keep the main chromosomes.
Have you checked to see what fraction of reads map to
other
categories of sequence? How do you handle multi-mappers in your current protocol?No, I'm not interested in haplotype specific expression.
The fraction of mapped reads over these chromosomes are 70-100 reads over 1 000 000. I knew, it is not a lot, I could process my data without those, but I want something very clean
Multi-mapped reads are allowed in the alignment but will be discard downstream.
I got by exemple this read, which is also multi-map on the same chromosome. In any way I'll loose it further in my analysis because I filter on "conventional" chromosomes. My question was is it correct to remove these "no conventional" chromosomes to let a chance to this read to map on a "conventional" one.
At my very least, if it doesn't align it will be discard by the aligner.
My main question was a general question. In what case do you have the right to filter out some chromosome to do an alignment
That is an interesting philosophical question. In grand scheme of things having ~100 reads (out of a million) map in locations where they should not have mapped is not going to make or break the experiment. There are many assumptions that go into these experiments. With mice even the type of strain you think you are working with has been subject to some ambiguity (interesting results with recent NeoGen genotyping chips). Thus even the reference you are using may not be the most appropriate in the first place but let us not go there.
As long as you are not discarding entire autosomes/sex chromosomes/MT it should be ok to filter out other unassigned DNA to simplify your life. I believe that is what iGenomes does with the bundles they provide.
I got the main idea, thank you all !