I have short reads that end in a non-genomic sequence. Clipping the reads before mapping is possible but not ideal. The optimal thing for me is to use bowtie2 or STAR and allow soft clipping. This works, but the mapping is worse compared to mapping of pre-clipped reads.
I can't seem to find any option to allow "more" soft clipping. Is there any way i can make STAR/bowtie2 soft clip up to half the read (from 3' or 5' end)? Or do the mappers already do that? Where can i read on how soft clipping works?
I think it is worth investigating what the mapping is worse means in your context. It is suspicious when cutting off ends of reads leads to "worse" mapping, although as I said, what the word "worse" means is essential.
The default expectation would be that after clipping more reads map overall and but the number of uniquely mapped reads decrease.
Exactly. Simply trimming the reads would increase the mapped %, because shorter reads are easier to align - however, more of them would be of low mapping quality, or just wrong.
Soft clipping is more sensitive than adapter removal, it's been shown pretty reliably I think.
With STAR version 2.3.0 you can trim for example 10 bases from 3 or 5 prime end with options --clip3pNBases 10 and --clip5pNbases 10. Have a look at the manual, its all there
I dont know if bowtie2 also has a built in read trimmer but if not, there are tons of other tools available. One example is fastx-toolkit
-5/--trim5 <int>
Trim <int> bases from high-quality (left) end of each read before alignment (default: 0).
-3/--trim3 <int>
Trim <int> bases from low-quality (right) end of each read before alignment (default: 0).
I would like to point out that this is hard clipping you're referring to.
By definition, soft clipping is not done at some defined length - rather, it's simply a modification of the scoring scheme that does not punish for mismatches at the ends of the read.
Thanks for all the answers. If i clip reads before mapping (i remove a certain sequence from the 3') more reads map (i only consider uniquely mapped reads) compared to if i don't clip the reads before mapping. I presume this is because sometimes i have to remove even half the read from 3' (>40 nucleotides) and soft clipping doesn't consider more than a few nucleotides? I still can't find anywhere any documentation on how soft clipping is performed, neither for bowtie2 or STAR.
The main problem is that only when aligning the reads to the genome i see how much clipping is necessary :) so i want to use soft-clipping to align the reads. But it seems that soft clipping works only for a few nucleotides? Of course i can pre-clip nucleotide by nucleotide the unmapped reads and then re-map but i was just thinking there is a more elegant way...
I think it is worth investigating what the mapping is worse means in your context. It is suspicious when cutting off ends of reads leads to "worse" mapping, although as I said, what the word "worse" means is essential.
The default expectation would be that after clipping more reads map overall and but the number of uniquely mapped reads decrease.
Exactly. Simply trimming the reads would increase the mapped %, because shorter reads are easier to align - however, more of them would be of low mapping quality, or just wrong.
Soft clipping is more sensitive than adapter removal, it's been shown pretty reliably I think.