Question

Trimming RNA Seq data - Invalid nucleotide sequence

0

Entering edit mode

8.7 years ago

curious ▴ 50

Hi,

I downloaded lots of SRA files (Chip-seq, RNA-seq, dnase etc.) from Roadmap project. I'm converting them to FASTQ format (fastq-dump with --split-files option) then do some preprocessing for maintaining consistency.

Since the sequence lengths coming out of these experiments are different, I'm trimming (using fastx_trimmer) the reads to a 36bp length. It works fine for FASTQs from Chip-seq SRAs. However, the FASTQ from RNA seq (ABI SOLID platform) have this format (first 8 lines)

@SRR179594.1 mendel_20110320_FRAG_BC_Ryan_RNA_Seq_2_58_404_F3 length=50
T.11.0223.0120.1020110202.0.0010.0.20.0201.2.021021
+SRR179594.1 mendel_20110320_FRAG_BC_Ryan_RNA_Seq_2_58_404_F3 length=50
!!@B!@A;B!BB:B!BB=A/%>(/%!A!.6%A!/!%'!%5.%!)!/()%-%
@SRR179594.2 mendel_20110320_FRAG_BC_Ryan_RNA_Seq_2_58_408_F3 length=50
T.20.3101.000021200002230.2.0312.0.13.0313.0.220003
+SRR179594.2 mendel_20110320_FRAG_BC_Ryan_RNA_Seq_2_58_408_F3 length=50
!!>B!<B:>!@@*?3-;%A9?A%'+!B!51,A!=!<'!:'.:!(!)-'*>5

Using fastx_trimmer on this to keep the first 36bp is throwing an error:

fastx_trimmer: found invalid nucleotide sequence (T.11.0223.0120.1020110202.0.0010.0.20.0201.2.021021) on line 2

Understandably due to a different format from ~ACTGN~. How do I go about this if I were to trim the RNA sequences?

RNA-Seq fastq sequence • 2.9k views

ADD COMMENT • link updated 8.7 years ago by mastal511 ★ 2.1k • written 8.7 years ago by curious ▴ 50

0

Entering edit mode

I guess you need to use abi-dump instead of fastq-dump if the data is from ABI. I have never used, but just a thought.

ADD REPLY • link 8.7 years ago by GouthamAtla 12k

0

Entering edit mode

It is likely you need to run the trimmer with the -Q33 qualificator

ADD REPLY • link 8.7 years ago by Antonio R. Franco ★ 5.2k

0

Entering edit mode

thanks for that, I didn't realize SRA toolkit had support for ABI specific files.

ADD REPLY • link 8.7 years ago by curious ▴ 50

0

Entering edit mode

This is not actually useful. ABI-Dump will extract your sequences into fasta and Quality separated files, but they have eventually to be joined again into a single fastq file for its use in many applications The dots meaning that the quality of the base call has been so bad, will remain the same

ADD REPLY • link 8.7 years ago by Antonio R. Franco ★ 5.2k

score 0 · Answer 1 · 2016-03-14

0

Entering edit mode

8.7 years ago

mastal511 ★ 2.1k

You need to convert the reads from SOLID colorspace to basespace (ACGTN).

Actually, I think a better idea is to use the SRA Toolkit

http://www.ncbi.nlm.nih.gov/books/NBK158900/

to convert the file into csfasta and quals files, and then use software that will support SOLiD data.

A previous post discussing software for SOLID is here:

Which Programs Are You Relying On For Solid Data Analysis?

ADD COMMENT • link 8.7 years ago by mastal511 ★ 2.1k

0

Entering edit mode

It is a very very bad idea to convert ABI colospace sequences to basespace sequences..

If you investigate why, you will discover that a failure or error in the colorspace sequence means a unique change in the color of that particular colorspace, whereas the remaining of the colorspace sequence (before and after the error) does not change at all.

However after converting the sequence to basespace, all the bases after the basecolor error changes

That means you can compare sequences in the colorspace environment if one of several errors are present, whereas it is impossible to do it after conversion

ADD REPLY • link 8.7 years ago by Antonio R. Franco ★ 5.2k

0

Entering edit mode

Thanks Antonio. How do I go about processing the colorspace sequences if conversion is a bad idea? appreciate the response, I'm new to this area.

ADD REPLY • link 8.7 years ago by curious ▴ 50

0

Entering edit mode

That is a serious problem with SOLiD data.. For example, if you are going to map those sequences using TopHat, you need to use the "old" bowtie1 version, and not the newest bowtie2, because colorspace mapping is deprecated in bowtie2

And the same happens with many other program

In addition Notice the many dots included into your sequences. This is typical of SOLiD data, and this is hard to manage. You cannot use a trimmer program without erasing too data

I was working a year ago with SOLiD data, and I eventually quit working with them.