I have two different question which may sound similar, but I think are different:
How to do genome guided denovo assembly? How to align the denovo assembled bam to a particular reference genome?
Explanation of problem in detail: I have aligned the fastq reads (100 bp - PE) for several samples from different individuals of different population to a reference genome using BWA mem.
The initial alignment percentage was good (about 80 - 90 %) read alignment. I then did a hard filtering on the aligned reads by removing the reads from the regions:
- with high coverage (above 97.5th percentile coverage distribution)
- reads where mates are mapped to different chromosome
- reads that contain hets sites for all the samples
- reads that have low mapQ
- and several other stringent filtering
This filtered about 30-40 % of the aligned reads from each sample. Now, I am thinking if I can
- take this filtered reads (as bam) and convert them to fastq
- and run a denovo assembly of these fastq (genome guided/unguided ??)
- or merge/align the denovo assembled bam to a reference
I think this should be able to take care of paralogous alignment, identifying big InDels, chromosomal changes to certain extent. Any one tried this? with which software might this be possible?
Thanks,
Why did you do these operations?
Hi @Brian:
If the issue about filtering isn't clear, this is what I did.
These kind of alignment regions generally are contaminated with paralogous alignment, so I had to do it. But, I don't want to throw the data away. I rather want to run a de novo assembly (using all the filtered data), then create a long read out of it, after that I want to blast this read with several different close species to see what's going on.
I'm not so sure about that. I'd suggest assembling with the raw data, after adapter-trimming, quality-trimming, and error-correction. Though I'd be interested in seeing any evidence you have regarding that point.
Doesn't the extremely high coverage in a particular region indicate misalignment of the paralogous reads?
Not necessarily... high coverage might indicate a misassembly, or a structural variation, or simply normal Illumina bias. If you filter out regions over the 97.5% percentile of coverage, you are guaranteed to lose 2.5% of your genome, whether it's correct or not, and whether the coverage is super-high or not. It does not seem to me to be a good strategy. Also, mapq 30 is an odd filter to apply. Variant-callers already take mapq into account, and 30 is pretty high. I would expect filtering on mapq 30 to yield extreme ref bias, which makes variant calling much less accurate.
So, what is you strategy for dealing with paralogous alignment. What is you choice for best filtering strategy after alignment step?
Also, I am not talking about human genome in here which are mostly homozygous. Since you have more experience with alignment: How should filtering strategy change when working with human like data (primates) vs. other animals vs. plants?
Any suggestions appreciated.
Thanks,
What organism are you working with?
Arabidopsis lyrata.
@ Brian: So, any suggestions ?
Generally, I would just map and call variants, and use the variant quality score as the primary filter critera. You may want to apply a mapq filter, but not that high. And I'm not convinced at all that a blanket high-coverage filter is a good idea. Or at least, not for 2.5% of the genome.
I find that it's helpful to look at the variants with IGV and Excel to see if there are regions with dense mismatches with low-quality-score variants, or other suspicious features (and yes, super-high coverage is suspicious, particularly when that region contains SNPs at strange allele ratios, like 1:5 var:ref in a diploid). When you spot them you can manually remove everything from that region, or look at the components of the variant quality (such as the strand ratio, average quality score, position of the variants in the read, etc) to see if there is some specific feature you can use to get rid of a lot of false positives automatically. This kind of depends on the data; the universal rules that work well are probably already baked into variant callers as defaults.
Thanks for the Info Brian. Actually, I have also looked the data with IGV, and saw that regions with high coverage have too much clustered SNPs which made me suspect those regions might be ill-aligned. I realize that using the bed file for coverage will remove all the read that touch that region (thus loosing more than just high coverage sites).
I might change my parameters to somewhat 1% and use allele-ratio as another guide.
Most of my sample have the similar patterns of coverage, as shown in this question/tutorial. There is no alignment at/around the centromere. So, I am thinking if the aligner forces the reads to align to another similar regions.
But, that for the suggestions !
This seems a XY problem. What do you really want to do? What is the question you are interest in?
Bam is most commonly the output of mapping, not assembly. And you usually (never?) align a bam to a fasta.
I am actually thinking about assembling the reads that were filtered. Then collapse that assembled reads to create a single reads (consensus sequence) which could be longer in length; and be able to blast that read across different species.