What minimum read length to retain in a standard rna-seq dataset?
2
0
Entering edit mode
8.5 years ago
firestar ★ 1.6k

I have a standard rna-seq dataset (125bp PE Illumina ) from a model organism. I am only doing adapter trimming and no quality trimming since the quality is excellent all the way through. There is an option in the trimming software to set minimum read len to keep. I was wondering what would be a good length and why.

My thoughts are are along these lines.

Set min length around 10-12: Would it help to keep short non coding RNAs if at all? I use ribosome depletion and not polyA capture.

Set min length around 60: Might reduce mapping time and potentially reduce multiple mapping of very short reads.

Set min length close to max length. ie; around 100 to 120: Depending on the sequence length distribution after trimming, I could potentially lose a lot of reads. Would it help with further downstream dge analysis to keep read length distribution is in a tighter range?

I could be wrong with all of these so feel free to correct me. And also some good suggestions for min length.

fastqc-plots

RNA-Seq • 7.0k views
ADD COMMENT
1
Entering edit mode

Shouldn't 125bp PE Illumina give you exactly 125 base reads every time? Anything less is a technical artifact or chemical problem. When my fragments are shorter than the read, the read continues into adapter sequence which must be trimmed, ultimately resulting in shorter reads, but that's got nothing to do with filtering by size up front. EDIT: your first figure shows this phenomenon, where some reads have adapter sequence on the end. So trimming everything to 80 would take care of that, at a loss of good data from the majority of the reads. Depending on your usage this may or may not be acceptable.

ADD REPLY
0
Entering edit mode

Actually all my raw reads are 126bp. The read length distribution I mentioned was after trimming. Trimming off the 3' adapter will result in reads of varying lengths. I can then choose to discard reads below a certain length. My question is about what this length should be?

ADD REPLY
0
Entering edit mode

Looking at the first figure, it would seem like the adapters start from base 78 and after trimming I should have min 78 base reads. But in reality I get all sorts of lengths. The plot probably samples a small subset of reads. And also the y-axis is %. 1-2% is hard to see. Even 1% of 30 mil reads is quite a few. Perhaps I should also mention that I am not doing a hard clip at any position. The software compares adapter sequences that I provide to only remove part of the read that matches which is why resulting lengths are variable.

ADD REPLY
2
Entering edit mode
8.5 years ago
igor 13k

Forget trimming. Many aligners can do their own soft clipping. Just align and then you can use mapping quality to filter out low confidence reads.

On a side note, setting min length as low as 10 will not have much effect. You should not have such short fragments with Illumina sequencing.

ADD COMMENT
1
Entering edit mode
8.5 years ago
ATRX ★ 1.1k

Although you mentioned that the quality of the reads looks excellent, I would like to know the following:

  1. What is the sequence duplication level ? You may get it thru FastQC report.
  2. What are overrepresented sequences and whats their length ?
  3. What is the level of Adapter contamination in your samples and what are the adapters that are overrepresented ?
  4. What information does kmer plot/content provides you ?

Answer to all the above questions will provide you the answer of your question.

Update: I second karl.stamm idea. If you see the first plot, it should be ~78. However, I would recommend running the pipeline with and without trimming and see if you do see any changes in the count/rpkm values of the genes. At times, aligner like STAR does soft clipping by default (I think Tophat does not do soft clipping).

ADD COMMENT

Login before adding your answer.

Traffic: 2264 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6