Question

How to choose Trimmomatic's parameter 'MINLEN '?

1

Entering edit mode

7.5 years ago

Hughie ▴ 80

Hello! Recentlly,I got a dataset to practice RNA-seq analysis, when I have done quality check using FastQC, I found some of my data have poor quality at the tail. So I want to trim the tail Using Trimmomatic, but I got in trouble with the parameter:

"MINLEN" :All of my 52 samples posses read length for about 43~57 bases, so I don't know how to choose a proper length(Maybe 1/3 of original length, I guess)to ensure my downstream mapping rate at a proper level, I would appreciate if given some advices!
"SLIDINGWINDOW:4:15": Actually, I chose threshold values at 15~30 respectively and I want to know your choose strategy when facing this parameter?
"LEADING": If it's necessary to trim the leading several bases everytime?

And I a beginner in RNA-seq, thanks a ton for your suggesting!

RNA-Seq next-gen sequence • 8.7k views

ADD COMMENT • link updated 7.5 years ago by Ian 6.1k • written 7.5 years ago by Hughie ▴ 80

score 4 · Answer 1 · 2017-06-13

4

Entering edit mode

7.5 years ago

Ian 6.1k

The lowest I select is MINLEN:35 as this was the read length Illumina sequence for a long time. The important thing is ensure the base quality is good, especially at the 3' end. I usually use SLIDINGWINDOW:4:20. I personally do not use LEADING, as it has never been a problem. Also make sure you run FastQC which will highlight any major issues, in particular the quality at either end of the reads.

ADD COMMENT • link 7.5 years ago by Ian 6.1k

1

Entering edit mode

Ian, thank you very much! But I'm afraid if I set MINLEN:35, there are pretty many reads lost. Below, I post counts colected from one of my trimmed .fastq file:

Read_length Read_num 10 2026 11 2334 12 1709 13 1421 14 1251 15 1362 16 1622 17 1812 18 2106 19 2318 20 2909 21 3959 22 5490 23 6103 24 7793 25 8843 26 9223 27 9572 28 9729 29 9917 30 10000 31 10046 32 10357 33 10222 34 10395 35 10451 36 10573 37 10762 38 11541 39 13131 40 26277 41 94748 42 156007 43 334605 44 330133 45 295795 46 337757 47 245022 48 132765 49 1915847

You can find the maximum length is 49, and there are many reads with length beyond 35, so, will it be ok if I choose 10 as a threshold value? Thanks again for your replying!

ADD REPLY • link 7.5 years ago by Hughie ▴ 80

1

Entering edit mode

I would not go down to 10bp. Remember the shorter the read length the greater the chance of a false positive match. You could go to 25bp if you are despirate. However, looking at the numbers of reads per length I think the majority of reads are >35bp anyway. Below 35bp numbers are only in the thousands.