BBDuk trimming concern: large discarded, too aggressive (?)
1
0
Entering edit mode
8.7 years ago
umn_bist ▴ 390

I am in piping PE RNA-Seq files for pre-processing QC (for somatic variant calling downstream).

I use exclusively BBDuk for adapter/quality trimming and prinseq for poly-A/T trimming.

My current configuration for BBDuk is:

#${BBDuk} in1="${file1}" in2="${file2}" out1="${file3}" out2="${file4}" ref="${adapter}" trimq=10 qtrim=r ktrim=r k=23 mink=11 hdist=1 tbo tpe minlen=40 outm=discarded.fastq

My discarded.fastq are close to 600-700 Mb in size and was curious if this is to be expected. My files are fairly high quality RNAseq files from TCGA.

BBDuk RNA-Seq • 4.6k views
ADD COMMENT
0
Entering edit mode

What was your input size?

Try out instead of outm.

ADD REPLY
0
Entering edit mode

@apelin20 My input size has been around 8-15 Gb.

ADD REPLY
1
Entering edit mode
8.7 years ago
GenoMax 147k

Are you sure these reads are ones that are discarded? The outm option on Brian's program catches anything that matched so it is possible that these are reads that were trimmed to some extent but may still be in the trimmed file (in trimmed form). Can you check?
UPDATE: I think I am right. See the original bbduk.sh post by Brian.

The "outm" stream will catch reads that matched a reference kmers.

ADD COMMENT
0
Entering edit mode

Could you explain further

these are reads that were trimmed to some extent but may still be in the trimmed file (in trimmed form)

So these are essentially Kmer contaminants (of min length of 11 bp) that are being catched but remains in the trimmed output file? I've seen on Biostar that Kmers - atleast 7mers seen on FastQC - shouldn't be removed for RNAseq somatic variant calling. Very curious to hear your take on this issue.

If these discarded.fastq files are basically adapters/Kmers/lowquality reads then my command is working as expected.

Thank you very much for your help.

ADD REPLY
2
Entering edit mode

Your command is working correctly. But let me clarify the meaning of the output streams:

"out" is the normal output for things that pass your criteria; the "outm" stream catches anything that will not get sent to "out" because it fails the criteria.

When filtering, "out" gets anything without matching kmers, and "outm" gets anything with matching kmers.

When trimming, "out" gets anything at least minlen in length, and "outm" gets anything shorter than minlen. For adapter trimming, being shorter than minlen implies an adapter kmer match, though it could be caused by quality-trimming as well, since you are doing both.

In other words, "outm" is normally for junk reads that fail your quality-control procedure, but in some cases it's useful. For example: "bbduk.sh in=reads.fq ref=ecoli.fa out=non-ecoli-reads.fq outm=ecoli-reads.fq" would allow you to analyze those read sets separately. But normally, for adapter-trimming, the outm would just be thrown away.

I don't think 700MB of 15GB is particularly bad, though it's hard to say because those reads were trimmed (700MB of very short reads are more reads than 700MB of untrimmed reads), so what's more important is how much good data remained. RNA-seq is not easy to size-select because doing so will bias your results against short isoforms, and so forth. It's nice to lose under 3% or so during adapter-trimming, but that doesn't always happen. You can look at the insert-size distribution with BBMerge:

bbmerge.sh in1=r1.fq in2=r2.fq mininsert=15 ihist=ihist.txt

Reads with insert size shorter than read length have adapters. Reads with insert size under 40 would be discarded according to your settings, after adapter-trimming. It's possible that you had a lot of adapter-dimers (which will not show up in BBMerge since there is no genomic sequence to merge) or that you had a lot of short inserts, in which case you may want to revise your sequencing protocol slightly.

The warning about removing over-represented kmers seen in FastQC is slightly tangential to this. They may be caused by adapters, and may not. Your BBDuk command uses 23-mers for matching in the main portion of the read, which is safe. It only uses 11-mers for the last 11 bases of the read; trimming those would not cause the read to be discarded anyway because it would still be longer than 40bp.

ADD REPLY
0
Entering edit mode

This was really insightful, thank you. I'll have to fastQC my trimmed outputs - I am cautious that I lost >5% of my reads (after adapter and quality trimming).

I definitely will check my insert-size distribution. Is there a means to have BBMerge auto-relay avg insert-size for each of my trim commands?

Thank you again, Brian.

ADD REPLY
0
Entering edit mode

BBDuk internally uses BBMerge to trim adapters based on read insert size if you use the "tbo" flag, which you did. But it doesn't actually report the insert sizes (though the read lengths of the adapter-trimmed reads will be equal to their insert sizes). The insert sizes will not change as a result of these commands.

I'm not quite sure what you mean by auto-relay, but JGI runs BBMerge on all pair libraries to quantify the insert-size distribution and make sure the library meets QC parameters.

ADD REPLY

Login before adding your answer.

Traffic: 1953 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6