Hi,
I downloaded reference genomes from Ensembl (fasta format).
But there are lots of sequences with name "dna:scaffold": https://github.com/CTLife/TEMP/tree/master/RefGenomes
Such as Mouse_GRCm38 (mm10), except chromosome 1-19, Mt, X and Y; others should be removed before mapping ? https://github.com/CTLife/TEMP/blob/master/RefGenomes/Mouse_GRCm38.p4.txt
Such as Human_GRCh38.p5 (hg38), https://github.com/CTLife/TEMP/blob/master/RefGenomes/Human_GRCh38.p5.txt, there are 516 sequences. In addition to chromosome 1-22, Mt, X and Y; others (such as CHR_HG2241_PATCH and KI270728.1) should be removed before mapping ?
OK, thank you. I am using BWA, Bowtie2 and Subread for ChIP-seq reads mapping. But for RNA-seq reads, the alternative contigs must be removed ?
How do you think about https://sequencing.qcfail.com/articles/genomic-sequence-not-in-the-genome-assembly-creates-mapping-artefacts/ ? It is a nice explanation of why we might not want to remove those extra sequences until after mapping.
If I understood well that link is about repetitive sequences, not about alternative contigs
For RNA-seq... it depends. For example, if you want to analyse HLA genes, which are highly diverse, you would need the alternative contigs. I guess most people just ignore alternative contigs because of the increase in complexity.