Question

Trimming RNA-Seq Data

0

Entering edit mode

6.7 years ago

williamsbrian5064 ▴ 530

Hi,

I had a couple questions about some RNA-Seq data. A colleague recently received some RNA-Seq data and we are attempting to trim the data. We are having some issues with the "per base sequencing content" and the "Sequence length distribution". I attached some pictures of the FastQC. So I think they length of the reads are kind of strange. They range from 20-68 bp long. Does that seems a bit strange?

Would trimming the data to the same size help with the sequence length distrubtuion? or does it not matter that we have different size reads?

Would trimming maybe 7 bases from the start of the reads and maybe 3 bases at the end help with the per base sequencing content?

I just want to make sure we have our data trimmed like we should before we proceed. Do these flags matter for RNA-Seq data?

Sequence Length Distribution Per base sequence content enter image description here

RNA-Seq Assembly next-gen sequencing • 3.6k views

ADD COMMENT • link updated 6.7 years ago by GenoMax 148k • written 6.7 years ago by williamsbrian5064 ▴ 530

score 3 · Accepted Answer · 2018-04-20

3

Entering edit mode

6.7 years ago

GenoMax 148k

They range from 20-68 bp long. Does that seems a bit strange?

No. The data is likely trimmed (check that file name). You could scan it again since there may be some additional extraneous bits left that could be removed.

Would trimming maybe 7 bases from the start of the reads and maybe 3 bases at the end help with the per base sequencing content?

While that distribution does look a bit more extreme than normal RNAseq data I would recommend aligning before doing any additional manipulations. If there are any issues with % data aligning then that would be a thing to look at. See this blog for more information.

ADD COMMENT • link 6.7 years ago by GenoMax 148k

0

Entering edit mode

I did remove some universal illumina adaptors. Some of the samples had an crazy amount of universal adaptors. Something like 10% of the data. So you won't matter that the sequences range from 20-68 bp? Won't the 20 bp reads be more likely to map to the incorrect region?

ADD REPLY • link 6.7 years ago by williamsbrian5064 ▴ 530

0

Entering edit mode

Won't the 20 bp reads be more likely to map to the incorrect region?

Likely. If you are want to enforce a minimum length on the reads then you can filter them by using BBMap suite's reformat.sh like this (replace NN with a number you want).

reformat.sh in=your.fq.gz out=longer_than_NN.fq.gz minlen=NN

If you have paired-end data then do

reformat.sh in1=your_R1.fq.gz in2=your_R2.fq.gz out1=longer_than_NN_R1.fq.gz out2=longer_than_NN_R2.fq.gz minlen=NN

ADD REPLY • link 6.7 years ago by GenoMax 148k

0

Entering edit mode

Do you have a suggested length? I think the majority of the reads will be around 64-67 bps long? Should I remove the remaining reads?

ADD REPLY • link 6.7 years ago by williamsbrian5064 ▴ 530

0

Entering edit mode

Would trimming maybe 7 bases from the start of the reads and maybe 3 bases at the end help with the per base sequencing content?

I just wanted to add, that like genomax said, although the distribution looks a little more extreme than usuall RNAseq data, it is characteristic for e.g. Illumina RNA-Seq data to show such bias in the first 12 nucleotides of the reads. This bias is generated during library generation and trimming the first few bases will therefor not eliminate it, since (as far as I know) it comes from site selection preferences during that process.