Expanding reference genome with alt loci and decoy sequences
1
1
Entering edit mode
2.3 years ago
appropiate ▴ 80

If I wanted to expand a reference genome fasta file with alternate loci and decoy fasta files in order to get better read alignments, would it be enough with merging the files (with, say, cat *.fa > expanded_genonme.fa) before starting the typical mapping workflow (samtools, picard, bwa ...), or there are additional steps to take care of?

read alt reference decoy genome alignment • 1.5k views
ADD COMMENT
1
Entering edit mode
2.3 years ago

Actually, your alignments will be worse when you expand a reference genome with alternate loci, see e.g. this recent comment by ATpoint and this blog post by Heng Li. The latter also highlights some issues with commonly used reference genomes, but the lack of alternate loci is a feature, not a bug ;-)

ADD COMMENT
0
Entering edit mode

Thanks! I have replied to that recent comment since it has more participants. Please feel free to comment there as well!

ADD REPLY
1
Entering edit mode

Neither what ATpoint nor I wrote, contradicts this tutorial in any way. Furthermore, I trust, it is a good one when written by the GATK team.

But in your original post, you stated that you are going to start the typical mapping workflow (samtools, picard, bwa ...), afterwards and alt-ware mapping is generally not considered to be the typical mapping workflow. Most aligners are by default not alternate contig aware (alt-aware) and will thus assign poor mapping quality scores to many reads due to erroneously considering them being multi-mappers.

Nothing wrong with alt contigs, as long as you are using the right tools and it's a deliberate choice.

ADD REPLY
0
Entering edit mode

Thanks! So regarding my original question, could I just merge the fasta files and follow that tutorial?

ADD REPLY
0
Entering edit mode

Yes, technically you can append any sequence to the reference followed by building a new index. If that makes sense or is appropriate is a different question.

ADD REPLY
0
Entering edit mode

My criteria for whether it makes sense would be comparing MAPQ values for alignments when using reference genome vs expanded genome. If there are fewer alignments with MAPQ = 0 in the latter, I would think it makes sense. I am way off here?

ADD REPLY

Login before adding your answer.

Traffic: 2072 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6