Hi,
I downloaded lots of SRA files (Chip-seq, RNA-seq, dnase etc.) from Roadmap project. I'm converting them to FASTQ format (fastq-dump with --split-files
option) then do some preprocessing for maintaining consistency.
Since the sequence lengths coming out of these experiments are different, I'm trimming (using fastx_trimmer) the reads to a 36bp length. It works fine for FASTQs from Chip-seq SRAs. However, the FASTQ from RNA seq (ABI SOLID platform) have this format (first 8 lines)
@SRR179594.1 mendel_20110320_FRAG_BC_Ryan_RNA_Seq_2_58_404_F3 length=50
T.11.0223.0120.1020110202.0.0010.0.20.0201.2.021021
+SRR179594.1 mendel_20110320_FRAG_BC_Ryan_RNA_Seq_2_58_404_F3 length=50
!!@B!@A;B!BB:B!BB=A/%>(/%!A!.6%A!/!%'!%5.%!)!/()%-%
@SRR179594.2 mendel_20110320_FRAG_BC_Ryan_RNA_Seq_2_58_408_F3 length=50
T.20.3101.000021200002230.2.0312.0.13.0313.0.220003
+SRR179594.2 mendel_20110320_FRAG_BC_Ryan_RNA_Seq_2_58_408_F3 length=50
!!>B!<B:>!@@*?3-;%A9?A%'+!B!51,A!=!<'!:'.:!(!)-'*>5
Using fastx_trimmer on this to keep the first 36bp is throwing an error:
fastx_trimmer: found invalid nucleotide sequence (T.11.0223.0120.1020110202.0.0010.0.20.0201.2.021021) on line 2
Understandably due to a different format from ~ACTGN~. How do I go about this if I were to trim the RNA sequences?
I guess you need to use
abi-dump
instead of fastq-dump if the data is from ABI. I have never used, but just a thought.It is likely you need to run the trimmer with the -Q33 qualificator
thanks for that, I didn't realize SRA toolkit had support for ABI specific files.
This is not actually useful. ABI-Dump will extract your sequences into fasta and Quality separated files, but they have eventually to be joined again into a single fastq file for its use in many applications The dots meaning that the quality of the base call has been so bad, will remain the same