Hi all,
I have a handful of genes of interest that I am trying to identify in a plant transcriptome dataset.
I used bwa mem to map the paired-end transcriptome reads against a reference dataset generated from the genes of interest.
The gene sequences I used as leads are not the same organism as the transcriptome, so I am expecting mismatches. I got pretty good coverage for most of the gene sequences, but the reads seem to cluster in chunks, and there is a lot of soft clipping going on in between the chunks.
I am wondering how to start playing with bwa mem parameters to try and retrieve more of the pieces in between the chunks.
Which parameters are good to modify? Matching score, mismatch penalty, or gap penalties?
I appreciate any suggestions!
Are you mapping the reads to the genome? Because in that case you definitely need a splice-aware aligner such as STAR or HISAT2.
As a rule of thumb: if you don't know what you are doing, stick to the default parameters. Definitely for a tool such as bwa mem the defaults are sensible and most optimal under common conditions. If there is a reason to map to the transcriptome you could also have a look at the "newer" generation alignment-free quantification tools such as kallisto and salmon.
Thanks for the suggestion! I mapped with STAR, and it got rid of the sharp cutoffs and peaks which were probably splice sites:
@WouterDeCoster, my understanding is that the pseudoalignment methods like Kallisto do not work well when the mismatch rate is high (such as this example, where the OP is using reference from a related species). Am I incorrect?
Right, I was probably too quick. It will definitely depend on how different/how many mismatches. Although, from the kallisto website:
And for salmon:
That said, I'm not aware of a study comparing pseudo-alignment to a different species, but can definitely understand it wouldn't be optimal. Nevertheless, even with mismatches you would get matching kmers and therefore correct assignment. For sure your transcriptome database needs to be complete! But since I'm one of those ignorant bastards who only sequences human samples I'm not really familiar with the struggles of 'plant people' as we would call them...
Thanks for the feedback. I was hoping you had a reference for comparative benchmarking, but it looks like I'll have to do the analysis myself.
You could try adjusting the mismatch penalty (-B) and clipping penalty (-L). That said, in this case, I'd suggest trying BBMap, which is a global aligner and does not do any clipping by default and allows low-identity alignments. With the -slow flag, it will allow even more mismatches. Also, if you are mapping to genomic sequence (containing introns) then you need a splice-aware aligner such as BBMap rather than bwa; in that case, the gaps would be explained by introns.