Hi all,
I was wondering what is the consensus about the optimal trimming for paired-end RNA-seq reads using bbduk.sh
? It's been a great tool for other applications that outperformed its competitors - e.g. bacterial genome assembly from Nextera short reads clearly worked better. However, now I mostly work with various RNA-seq experiments and was wondering if someone has an opinion about the best approach.
So far, I've been using this command, that I've compiled from the readme at some point - by using the settings for genomic PE reads, and adding the "trimpolya" option:
bbduk.sh in1=${TAG}_1.fastq.gz in2=${TAG}_2.fastq.gz out1=${TAG}_bbduk_1.fastq.gz out2=${TAG}_bbduk_2.fastq.gz ref=$ADAPTERS trimpolya=10 ktrim=r k=23 mink=11 hdist=1 tpe tbo &> $TAG.bbduk.log
Does this look reasonable? Is there anything else to consider here? Maybe Brian could comment?
Thank you in advance, as always.
Thank you, this gives me confidence. The threads thing is very interesting - I thought Java just used all the cores you gave it, at least that was my rookie impression.
Does that mean that if my reads are 2x150 bp, I want to set it to a much higher value, like 70?
Yikes. I meant to say smaller than 1/2 the length of adapter (or any other sequence you are looking to find). Will edit my answer above.
That is possible but to make sure
java
does not misbehave I find it safer to explicitly add memory and core allocations. One needs to be careful when running via a job scheduler on a cluster.