fastq headers problem with Trinity
1
0
Entering edit mode
4.7 years ago
agata ▴ 10

I downloaded data as SRA file and used fastq_dump according to Trinity recommendations.

fastq-dump --skip-technical  --readids --read-filter pass --dumpbase --defline-seq '@$sn[_$rn]/$ri' --split-files ./SRR.sra

Then I run quality control with FastQC and trimmed out adapters with trimmomatic.

My headers looks like this:

head -n 8 SRR5874687.1_pass_1_trim.fastq 
@/1
GACCGTAGCCGTGGTATTTACTTCACTCAAGACTGGTGTTCAATGCCAGGTGTTATGCCAGTTGCTTCAGGTGGTATTCACGTATGGCACATGCCAGCTT
+SRR5874687.1.171.1 length=100
?1BDDD8B:AC@DEA:ACHAAFH?+2?1??FD9CFH9BDGDBBDGECBDFCFHG>F=@GFFHGIGIH@AHEEHF4;@C3.>BA3AAD=5;,:>C@>><CC
@/1
CTTTTACTGAATCCATGGGGTGTTTCTTATTCTTAGCTCAAAGTCTGTACATGTTGTGCACGTGCTGAAACCGCGTGTGCCGGTTGCGCGAGTCCTCTCA
+SRR5874687.1.172.1 length=100
?@@FFADDHFFFFGECGIGI@GHIJ?HHDGH?FH?D@GGHGGGGGIIGCGHDEHIHIICFFHHICGGCDHECBBBCBBDDDDD=B>?B>B5953:>:@>:

head -n 8 SRR5874687.1_pass_2_trim.fastq 
@/2
CACCGAACTGAAGACATGCGTCATCACCGAAGATTTCAACTAAAGCTGGCATGTGCCATACGTGAATACCACCTGAAGCAACTGGCATAACACCTGGCAT
+SRR5874687.1.171.2 length=100
@@@DFFDDHBFHDHGBFG@@C<@F>??CFHIH0??FFIGII<BBC@FCFCHGH.7777=D;AHEFB@?7;;>BEC;@CCCC??ACBCCCCCCC?CC@?CC
@/2
CTGGACAACGCGCCGCAATATTGCAGCTTATTAGTTTGGTGATGAGAGGACTCGCGCAACCGGCACACGCGGTTTCAGCACGTGCACAACATGTACAGAC
+SRR5874687.1.172.2 length=100
?@@FBDDDFHDHHJJJIGHIIJJGGHIGI?FH<DFHJJJCF@GHFHGHIGHHEEEDDDDDDDDDDDDDD@BBBBDDEDDDDDBDDDDDDDDDDDEEEECB

At this point Trinity had problem with empty kmer25.

Primarily I was thinking that the problem is with header position (3rd line instead of 1st), so I asked here for help. First I moved headers from third line to the first with awk proposed method and then used bbmap to add slashes (/1 and /2).

Now, headers look like this:

head -n 8 slashed_biostar_1.fastq 
@SRR5874687.1.171.1 length=100 /1
GACCGTAGCCGTGGTATTTACTTCACTCAAGACTGGTGTTCAATGCCAGGTGTTATGCCAGTTGCTTCAGGTGGTATTCACGTATGGCACATGCCAGCTT
+
?1BDDD8B:AC@DEA:ACHAAFH?+2?1??FD9CFH9BDGDBBDGECBDFCFHG>F=@GFFHGIGIH@AHEEHF4;@C3.>BA3AAD=5;,:>C@>><CC
@SRR5874687.1.172.1 length=100 /1
CTTTTACTGAATCCATGGGGTGTTTCTTATTCTTAGCTCAAAGTCTGTACATGTTGTGCACGTGCTGAAACCGCGTGTGCCGGTTGCGCGAGTCCTCTCA
+
?@@FFADDHFFFFGECGIGI@GHIJ?HHDGH?FH?D@GGHGGGGGIIGCGHDEHIHIICFFHHICGGCDHECBBBCBBDDDDD=B>?B>B5953:>:@>:

head -n 8 slashed_biostar_2.fastq 
@SRR5874687.1.171.2 length=100 /2
CACCGAACTGAAGACATGCGTCATCACCGAAGATTTCAACTAAAGCTGGCATGTGCCATACGTGAATACCACCTGAAGCAACTGGCATAACACCTGGCAT
+
@@@DFFDDHBFHDHGBFG@@C<@F>??CFHIH0??FFIGII<BBC@FCFCHGH.7777=D;AHEFB@?7;;>BEC;@CCCC??ACBCCCCCCC?CC@?CC
@SRR5874687.1.172.2 length=100 /2
CTGGACAACGCGCCGCAATATTGCAGCTTATTAGTTTGGTGATGAGAGGACTCGCGCAACCGGCACACGCGGTTTCAGCACGTGCACAACATGTACAGAC
+
?@@FBDDDFHDHHJJJIGHIIJJGGHIGI?FH<DFHJJJCF@GHFHGHIGHHEEEDDDDDDDDDDDDDD@BBBBDDEDDDDDBDDDDDDDDDDDEEEECB

But this time Trinity not recognizing read name formatting: [SRR5874687.1.171.2]

I have one guess, that maybe if put number of read after a space at the end it could help, like: [SRR5874687.1.171.2 1]. Somebody knows how to do it automatically?

Do you know what messed up headers? Because they didn't change after using trimmomatic (they were exactly the same as after using fastq-dump on raw data (that worked perfectly till now) ).

fastq trinity • 1.4k views
ADD COMMENT
0
Entering edit mode
4.6 years ago
agata ▴ 10

After all, the format of the header that enables me to run Trinity is as follow:

@1/1
GACCGTAGCCGTGGTATTTACTTCACTCAAGACTGGTGTTCAATGCCAGGTGTTATGCCAGTTGCTTCAGGTGGTATTCACGTATGGCACATGCCAGCTT
+SRR5874687.1.171.1 1 length=100
?1BDDD8B:AC@DEA:ACHAAFH?+2?1??FD9CFH9BDGDBBDGECBDFCFHG>F=@GFFHGIGIH@AHEEHF4;@C3.>BA3AAD=5;,:>C@>><CC
@2/1
CTTTTACTGAATCCATGGGGTGTTTCTTATTCTTAGCTCAAAGTCTTACATGTTGTGCACGTGCTGAAACCGCGTGTGCCGGTTGCGCGAGTCCTCTCA
+SRR5874687.1.172.1 2 length=100
?@@FFADDHFFFFGECGIGI@GHIJ?HHDGH?FH?D@GGHGGGGGIIGCGHDEHIHIICFFHHICGGCDHECBBBCBBDDDDD=B>?B>B5953:>:@>:
@3/1
TCACATTATATTTCTGTTTTTGATCAACAATATCGTTTACCACGTAATCGTTATCTAAAACGCAACCCATTAAAAACATTGAAAAAAACGACTTTATTAG
+SRR5874687.1.173.1 3 length=100
B@@FFDFDHHHHHIIJJJJJJJHIDJJJJIHIIHHJIJGG<**??F8?08??/?)8=FH@=@'B/AH=(7?CEEEA9(;;;@CCD?D<<9505?C3>@3:

So in the first line after "@" has to be the number of a read and the unified code "/1" or "/2" that respcets first or second read of paired reads. In the third line after "+" should be kept the original ID of the read, and after a space the number of read should be repeated. After the next space may be present the information about length of the read (not necessary).

I used "awk" command to edit headers - here code for first read of pair:

Edit the first line: awk '/^@\/1/{sub(/\/1/,++i"/1")}1' SRR5874687.1_pass_1_trim.fastq > out.txt and then the third line: awk '/^+SRR/{sub(/ /, " " ++i " ")}1' out1.txt > SRR5874687.1_pass_1_trim_awk.fastq

ADD COMMENT

Login before adding your answer.

Traffic: 2538 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6