One of the fastq file that I am analysing has IDs of the reads not in the first line, but in the third, after "+".
I would appreciate any tips how to reorganise it.
when you think you saw it all, someone is tasked to analyze a FASTQ file where each sequence is named @/1 and the third line that optionally should contain the name is the one that contains the identifier of the record. oh well. Out of curiosity, how was this file made?
I downloaded data from NCBI
It seems that was produced by Illumina HiSeq 2000... So who knows what was done before uploading... ¯_(ツ)_/¯
Could it be fault of Illimina?
I rather think that the headers were changed during upload to NCBI. This is not uncommon. It is also common that line 3 is just a duplicate of line 1 but with a + as starting character. There is an option to use the read names in fastq-dump
=> -F|--origfmt Defline contains only original sequence name
NCBI will not allow an invalid fastq file, and this file is invalid for several reasons. First, the sequence ids are the same. In addition, the third line (past the + symbol) should either be empty or identical to the first line. But it may not be different from the first line.
More to the point, let's find the exact data the original poster has:
what does your format need to look like? we can give you a simple program that rewrites the sequence as you need. But start with the SRA default output.
The SRA is a failure of a format, hoisted on the scientific world by bureaucrats that have never touched data in their lives - as demonstrated by your problems.
It seems strange, as the IDs again are in the "+" row... I am sorry, I should check it. It seems that the only thing that I needed was to give numbers in the first rows... As they are all the same in my example in original question.
You can use reformat.sh from BBMap suite if Trinity is looking for Illumina style fastq headers.
addslash=f Append ' /1' and ' /2' to read names, if not already present. Please include the flag 'int=t' if the reads are interleaved.
addcolon=f Append ' 1:' and ' 2:' to read names, if not already present. Please include the flag 'int=t' if the reads are interleaved.
No it should not. It should be 4 lines. Sorry, I do not know why this is happening. awk seems to interpret the whitespace between +SRR5874687.1.171.1 and length=100 as a tab. Will check, stay tuned.
Ok, found the error. The solution is to put FS="\t" behind the first awk command and not as before awk 'FS="\t" (...). Sorry for the confusion. Updated my answer.
Why is this file called
.fasta
?then how come
-n 3
prints more than three lines?@agata must have edited the output and made it go into proper lines.
I'm really sorry for confusion... I had to paste it from two terminals... of course it is not correct