SNP-masking before short-read alignment to human genome
2
1
Entering edit mode
9.5 years ago

Hello

I was wondering if somebody has experience with aligning short reads to the reference human genome, and whether masking SNPs improves alignment and decreases reference bias. If masking is advised, how do you choose which SNPs to mask? Would dbSNP with MAF > 1% be a reasonable choice? Do you write your own custom script for this, or are there resources or tools already available?

Thank you for your suggestions

Roberto

RNA-Seq SNP Reference-bias • 3.2k views
ADD COMMENT
0
Entering edit mode
9.5 years ago
h.mon 35k

Instead of masking, this paper suggests creating "individualized" reference genomes and transcriptomes. They provide a software to do so, but I never tested. From their abstract:

We demonstrate in simulated and real data sets that alignment to individualized transcriptomes increases read mapping accuracy, improves estimation of transcript abundance, and enables the direct estimation of allele-specific expression.

edit: reading the instructions on the GitHub page is not very inspiring, the software consists of five scripts and also needs two vcf files beforehand. While it may be easy to run it, I am under the impression it is not.

ADD COMMENT
0
Entering edit mode

Personal Genome constructor can also be used for the same purpose

http://alleleseq.gersteinlab.org/tools.html

ADD REPLY
0
Entering edit mode

Hi Ashutosh, what is your experience with Personal Genome Constructor? Do you see significant improvements over the standard human genome reference?

ADD REPLY
1
Entering edit mode

Hi Roberto, I don't work with human data but I have seen a considerable improvement in the RNA-seq read mapping when aligning reads from genetically divergent mouse strain against the pseudo reference genome (substituting SNPs and Indels, I use my own scripts) in the original reference genome. I worked on allele-specific expression analysis so I had to be pretty careful with the allelic bias in read mapping that could have produced false-positive allelic imbalances. I think the question whether you should generate a pseudo reference genome (substituting SNPs and Indels) depends on how divergent the two genomes are? If I have to align reads from a sub-strain of the reference genome that only differs at say 10-50k variants, then I simply allow for a few more errors (1-2) but I make sure that I am getting unique alignments for reads (My goal is to be able to get unique alignments for reads by allowing an extra or two mismatch) and this has worked for me perfectly so far. I think Brian's input should be more valuable than mine at this point because of his expertise and experience.

ADD REPLY
0
Entering edit mode

Thank you Ashutosh, that is very clear and helpful!

ADD REPLY
0
Entering edit mode
9.5 years ago

Ref bias is primarily caused by reads that are too short and aligners with low sensitivity. Rather than mutating your reference, which can cause a lot of confusion down the road, I recommend that you simply use a highly sensitive aligner to minimize ref bias. Specifically, I recommend BBMap; one of my main goals when developing it was to minimize the ref bias inherent to aligning very short reads with very high error rates (Solid 4 reads), which is the worst-case scenario for ref bias.

ADD COMMENT
0
Entering edit mode

Thanks Brian! How does BBMap perform with spliced/split reads? Also, will higher sensitivity result in more multimappers that will be eventually discarded, or will precision remain high? Upon trimming reads, what min length do you suggest to retain to avoid reference bias?

ADD REPLY
1
Entering edit mode

BBMap is splice-aware, and performs excellently on RNA-seq data. If you are mapping RNAseq data, though, you will need to set the "maxindel" flag appropriately to the intron length of the organism in question; for human/mouse I recommend 200000. If you are mapping across strains/species, BBMap will definitely boost the mapping rate substantially and reduce ref bias compared to an aligner like Tophat, which is not very tolerant of errors.

High sensitivity does not increase the rate of multimapping within a specified set of reads. Whether BBMap classifies a read as uniquely mapped or multimapped depends on the difference in score between the highest-scoring mapping location and next-highest-scoring mapping location. Increasing the sensitivity will increase the number of reads that map; some of the additional reads that map at a higher sensitivity will multimap with low scores to multiple locations, but the reads that had already mapped uniquely at a lower sensitivity will not suddenly become multimappers once the sensitivity is increased.

I do not recommend quality-trimming reads prior to mapping (unless you trim at a very low threshold, like Q=6), as that will reduce sensitivity. I do recommend adapter-trimming, though. As for minimum length - that depends on your initial read length! Normally I discard things that are under 50bp or so. But when specificity is very important (such as when quantifying cross-contamination rates, something I do a lot) I discard anything shorter than about 80% of the initial length; we generally use 150bp reads, so 120bp. Paired reads help a lot with both sensitivity and sensitivity, of course.

BBMap has an ambig=toss flag which will reclassify all reads that multimap as unmapped; I prefer that setting when resequencing to call variations, as it reduces false-positives.

ADD REPLY
0
Entering edit mode

Thanks Brian, pretty cool, I will give BBMap a try then! Thanks also for all your other suggestions, that is pretty useful!

ADD REPLY

Login before adding your answer.

Traffic: 2248 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6