Editing fastq file
2
0
Entering edit mode
4.7 years ago
agata ▴ 10

One of the fastq file that I am analysing has IDs of the reads not in the first line, but in the third, after "+". I would appreciate any tips how to reorganise it.

head -n 3 file.fasta
@/1 
GACCGTAGCCGTGGTATTTACTTCACTCAAGACTGGTGTTCAATGCCAGGTGTTATGCCAGTTGCTTCAGGTGGTATTCACGTATGGCACATGCCAGCTT
+SRR5874687.1.171.1 length=100
?1BDDD8B:AC@DEA:ACHAAFH?+2?1??FD9CFH9BDGDBBDGECBDFCFHG>F=@GFFHGIGIH@AHEEHF4;@C3.>BA3AAD=5;,:>C@>><CC
@/1
CTTTTACTGAATCCATGGGGTGTTTCTTATTCTTAGCTCAAAGTCTGTACATGTTGTGCACGTGCTGAAACCGCGTGTGCCGGTTGCGCGAGTCCTCTCA
+SRR5874687.1.172.1 length=100
?@@FFADDHFFFFGECGIGI@GHIJ?HHDGH?FH?D@GGHGGGGGIIGCGHDEHIHIICFFHHICGGCDHECBBBCBBDDDDD=B>?B>B5953:>:@>:
@/1
TCACATTATATTTCTGTTTTTGATCAACAATATCGTTTACCACGTAATCGTTATCTAAAACGCAACCCATTAAAAACATTGAAAAAAACGACTTTATTAGCTCAA
+SRR5874687.1.173.1 length=100
B@@FFDFDHHHHHIIJJJJJJJHIDJJJJIHIIHHJIJGG<**??F8?08??/?)8=FH@=@'B/AH=(7?CEEEA9(;;;@CCD?D<<9505?C3>@3:@%E
fastq bash • 3.1k views
ADD COMMENT
0
Entering edit mode

head -n 3 file.fasta

Why is this file called .fasta?

ADD REPLY
0
Entering edit mode

then how come -n 3 prints more than three lines?

ADD REPLY
0
Entering edit mode

@agata must have edited the output and made it go into proper lines.

ADD REPLY
0
Entering edit mode

I'm really sorry for confusion... I had to paste it from two terminals... of course it is not correct

ADD REPLY
2
Entering edit mode
4.7 years ago
ATpoint 85k
paste -d "\t" - - - - < file.fastq \
| awk '{print $3"\n"$2"\n""+""\n"$4}' FS="\t" \
| awk '{gsub("^+SRR", "@SRR");print}'

Produces:

@SRR5874687.1.171.1 length=100
GACCGTAGCCGTGGTATTTACTTCACTCAAGACTGGTGTTCAATGCCAGGTGTTATGCCAGTTGCTTCAGGTGGTATTCACGTATGGCACATGCCAGCTT
+
?1BDDD8B:AC@DEA:ACHAAFH?+2?1??FD9CFH9BDGDBBDGECBDFCFHG>F=@GFFHGIGIH@AHEEHF4;@C3.>BA3AAD=5;,:>C@>><CC
@SRR5874687.1.172.1 length=100
CTTTTACTGAATCCATGGGGTGTTTCTTATTCTTAGCTCAAAGTCTGTACATGTTGTGCACGTGCTGAAACCGCGTGTGCCGGTTGCGCGAGTCCTCTCA
+
?@@FFADDHFFFFGECGIGI@GHIJ?HHDGH?FH?D@GGHGGGGGIIGCGHDEHIHIICFFHHICGGCDHECBBBCBBDDDDD=B>?B>B5953:>:@>:
@SRR5874687.1.173.1 length=100
TCACATTATATTTCTGTTTTTGATCAACAATATCGTTTACCACGTAATCGTTATCTAAAACGCAACCCATTAAAAACATTGAAAAAAACGACTTTATTAGCTCAA
+
B@@FFDFDHHHHHIIJJJJJJJHIDJJJJIHIIHHJIJGG<**??F8?08??/?)8=FH@=@'B/AH=(7?CEEEA9(;;;@CCD?D<<9505?C3>@3:@%E
ADD COMMENT
1
Entering edit mode

when you think you saw it all, someone is tasked to analyze a FASTQ file where each sequence is named @/1 and the third line that optionally should contain the name is the one that contains the identifier of the record. oh well. Out of curiosity, how was this file made?

ADD REPLY
0
Entering edit mode

I downloaded data from NCBI It seems that was produced by Illumina HiSeq 2000... So who knows what was done before uploading... ¯_(ツ)_/¯ Could it be fault of Illimina?

ADD REPLY
0
Entering edit mode

I rather think that the headers were changed during upload to NCBI. This is not uncommon. It is also common that line 3 is just a duplicate of line 1 but with a + as starting character. There is an option to use the read names in fastq-dump

=> -F|--origfmt Defline contains only original sequence name

This could help avoiding this in the future.

ADD REPLY
0
Entering edit mode

NCBI will not allow an invalid fastq file, and this file is invalid for several reasons. First, the sequence ids are the same. In addition, the third line (past the + symbol) should either be empty or identical to the first line. But it may not be different from the first line.

More to the point, let's find the exact data the original poster has:

fastq-dump -X 171 -I --split-files SRR5874687

then

cat SRR5874687_1.fastq | tail -4

prints:

@SRR5874687.171.1 length=100
GACCGTAGCCGTGGTATTTACTTCACTCAAGACTGGTGTTCAATGCCAGGTGTTATGCCAGTTGCTTCAGGTGGTATTCACGTATGGCACATGCCAGCTT
+SRR5874687.171.1 length=100
?1BDDD8B:AC@DEA:ACHAAFH?+2?1??FD9CFH9BDGDBBDGECBDFCFHG>F=@GFFHGIGIH@AHEEHF4;@C3.>BA3AAD=5;,:>C@>><CC

amusingly, all along, the simplest solution was to just get the data again. ;-) no need for clever awk solutions

ADD REPLY
0
Entering edit mode

I had to use the fastq-dump line preferred by Trinity as my goal is to do assembly. I used this line:

fastq-dump --outdir fastq --gzip --skip-technical  --readids --read-filter pass --dumpbase --defline-seq '@$sn[_$rn]/$ri' --split-files ./SRR.sra

Unfortunately now, after applying awk Trinity does not recognise reads.

ADD REPLY
0
Entering edit mode

what does your format need to look like? we can give you a simple program that rewrites the sequence as you need. But start with the SRA default output.

The SRA is a failure of a format, hoisted on the scientific world by bureaucrats that have never touched data in their lives - as demonstrated by your problems.

ADD REPLY
0
Entering edit mode

Trinity recommends using this line when data are downloaded as SRA.

fastq-dump --defline-seq '@$sn[_$rn]/$ri' --split-files file.sra

I am sorry, I couldn't find specific requirements of the name/id of reads and I am not sure how this parameter works

--defline-seq '@$sn[_$rn]/$ri'

Here is an example of reads names that are accepted by Trinity.

head -n 12 SRR1274853_1.fastq    
@1/1
NAAATAAGAGATAAGAAAAAGAAAAAATAAGAAGGGAAAAGGGAAGAATAAAATAAAAAAATAAAAAAAAGAAAAATTAAAAAAGGAAACAAGGAGAAAA
+SRR1274853.1 1 length=100
#1:?B;DDDADDC<E9EFH>@FBFBCEF>DFHGHII??BDB77;;6BGIECC@>@F4@DEB2>>ACCB=B7@8?C??>CCCCCB@08?CAC@?BBBB?CC
@2/1
NAAAATAATATGACTATTTATGCTTATGGAGATGACTTAATTTTTTCGGTGAATCCATTATACATAGACCAATTTAATTCTGTTACTATCTCAGAATTTT
+SRR1274853.2 2 length=100
#1=DDDFFDHGGBGHJJJFJJJHFIHGHJGGGHHIJJJGJJIJJJJGHIFHBEDGIIEIJJIJJGGEHJAHHEEECEHGBFFFFFFDDDEEEDDDDDACD
@3/1
NCCCTGTTTCATTTATTAAAAATAATATGACTATTTATGCTTATGGAGATGACTTAATTTTTTCGGTGAATCCATTATACATAGACCAATTTAATTCTGT
+SRR1274853.3 3 length=100
  1+ADDBDHHHHDHGGIEHGIGGII@FHIIIIIIIAHAEHCHIGGGFHHH>DFGHHECHIIIIIGE5;;@=DACHHECHFEBEFFF>C@CCCCECDDAA#hifgdggbdeead_bbcccccdcdccc

It seems strange, as the IDs again are in the "+" row... I am sorry, I should check it. It seems that the only thing that I needed was to give numbers in the first rows... As they are all the same in my example in original question.

ADD REPLY
0
Entering edit mode

You can use reformat.sh from BBMap suite if Trinity is looking for Illumina style fastq headers.

addslash=f              Append ' /1' and ' /2' to read names, if not already present.  Please include the flag 'int=t' if the reads are interleaved.
addcolon=f              Append ' 1:' and ' 2:' to read names, if not already present.  Please include the flag 'int=t' if the reads are interleaved.
ADD REPLY
0
Entering edit mode

Thank you very much for help!

Is it always the case that first read has 5 lines? As there is additional line with length = 100 after "+"?

ADD REPLY
0
Entering edit mode

No it should not. It should be 4 lines. Sorry, I do not know why this is happening. awk seems to interpret the whitespace between +SRR5874687.1.171.1 and length=100 as a tab. Will check, stay tuned.

ADD REPLY
0
Entering edit mode

Ok, found the error. The solution is to put FS="\t" behind the first awk command and not as before awk 'FS="\t" (...). Sorry for the confusion. Updated my answer.

ADD REPLY
0
Entering edit mode

Thank you once more! It works perfectly now :)

ADD REPLY
1
Entering edit mode
4.7 years ago

Here is another solution

cat data.fq | awk '
    NR%4 == 2 { s = $0 } 
    NR%4 == 3 { a = substr($0, 2, 10000) } 
    NR%4 == 0 { printf ("@%s\n%s\n+\n%s\n", a, s, $0) }'

it relies on collecting lines, but only printing it at the end of the record (write it all as one line not wrapped like above:

@SRR5874687.1.171.1 length=100
GACCGTAGCCGTGGTATTTACTTCACTCAAGACTGGTGTTCAATGCCAGGTGTTATGCCAGTTGCTTCAGGTGGTATTCACGTATGGCACATGCCAGCTT
+
?1BDDD8B:AC@DEA:ACHAAFH?+2?1??FD9CFH9BDGDBBDGECBDFCFHG>F=@GFFHGIGIH@AHEEHF4;@C3.>BA3AAD=5;,:>C@>><CC
@SRR5874687.1.172.1 length=100
CTTTTACTGAATCCATGGGGTGTTTCTTATTCTTAGCTCAAAGTCTGTACATGTTGTGCACGTGCTGAAACCGCGTGTGCCGGTTGCGCGAGTCCTCTCA
+
?@@FFADDHFFFFGECGIGI@GHIJ?HHDGH?FH?D@GGHGGGGGIIGCGHDEHIHIICFFHHICGGCDHECBBBCBBDDDDD=B>?B>B5953:>:@>:
@SRR5874687.1.173.1 length=100
TCACATTATATTTCTGTTTTTGATCAACAATATCGTTTACCACGTAATCGTTATCTAAAACGCAACCCATTAAAAACATTGAAAAAAACGACTTTATTAGCTCAA
+
B@@FFDFDHHHHHIIJJJJJJJHIDJJJJIHIIHHJIJGG<**??F8?08??/?)8=FH@=@'B/AH=(7?CEEEA9(;;;@CCD?D<<9505?C3>@3:@%E
ADD COMMENT

Login before adding your answer.

Traffic: 2536 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6