Is there a way to update the third line in a fastq file (i.e, lines starting with +) programmatically using Biopython? I can parse the file using SeqIO but I don't know which field to update.
Thanks
Is there a way to update the third line in a fastq file (i.e, lines starting with +) programmatically using Biopython? I can parse the file using SeqIO but I don't know which field to update.
Thanks
Here is my python based solution to this, if anyone else is looking to modify the line with "+" sign:
with open(fh, 'w') as unique_file:
for record in SeqIO.parse(readsFile, "fastq"):
rec = record.format("fastq").splitlines()
#compute your stats for this rec
....
#
rec[2] = rec[2] + "Store your stats pertaining to the record"
result = "\n".join(rec) + "\n"
fh.write(result)
Not an elegant solution. But gets the work done in my case.
-Cheers
sorry, fixed the code. it should be record.format, It looks like most of the aligners ignore the third line and I haven't faced any problems with this format in our pipeline. You are right I cannot use SeqIO anymore on this record. But I don't intend to use Biopython after this step is done.
what do you mean by update? i love using bash for editing files
sed 's/^+/+something/g'
//edit the lines start with a '+' into start with '+something'
Do a modulus to get every third line:
$ awk '{if (NR%4==3) { /* do something with $0 */ }}' in.fq > out.fq
Biowhatever libraries are way overkill for basic text munging. And often slow and complicated.
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
This sounds like an XY problem. Can you clarify what you are trying to accomplish?
Also, this is a question, not a forum post, so I have modified it.
I need to store information pertaining to each read in the third line. We have other tools in the pipeline that read the third line in each sequence read to compute some stats.
Sounds like you are making a custom, incompatible version of fastq. Is that correct? Also, you have not explained what you want to accomplish.
Not an incompatible version of fastq, I will be adding info next to + sign, so all the other aligner tools will still work as most of the tools ignore info in the third line. We have tools in our own custom pipeline that read info next to the + sign to compute stats.
From Wikipedia:
"Line 3 begins with a '+' character and is optionally followed by the same sequence identifier (and any description) again."
What does this mean? Basically, line 3 can have only a "+", or a "+" followed by the exact same information as the first line. Is it stupid? Yes, it is. But that's the format. So, you can create whatever kind of custom files you want, but they won't be FASTQ if you add arbitrary information to the 3rd line. It will be difficult to find existing programs that support your proposed custom format.
In this situation, I recommend adding tab-delimited fields to the first line, as that will not violate the FASTQ spec. The FASTQ spec does not, in fact, officially exist, but I'm sure that it still would not like to be violated.
I actually read recently somewhere that the SEQID is supposed to match [A-Za-z0-9_.:-]+ but if any format liked being violated, it would be FASTQ
Still a bit language-specific, though. Regexes are inherently language-specific so I don't consider them to be universally descriptive.
Do you absolutely need to do this in biopython/standard python?
Something like this would work for 'normal' python:
Always more convenient to use an existing parser, although I agree it's pretty easy in this case.
True, unless you're trying to avoid dependencies ;)
OP hasn't made it clear why/if they have to use Biopython however so I imagine they aren't that worried about dependency
Actually, this comment C: Update third line in fastq file using Biopython also applies to your short piece of code... Quality scores can begin with a '+' which would get nasty :p
Yeah I did see that ;) that's a good catch. I assume Biopython's parsers deal with it numerically (the n-th line or something) rather than just looking for string matches...?
I haven't looked into their code, but I assume the parsing is sound. If I'm not mistaken, in general Biopython parsers are a bit slower than your own out-of-the-box solutions, presumably to take everything into account and don't assume anything about the structure of the input file. But checking the biopython code could clear this up...
Yeah the Bio-whatever (biopython, bioperl, etc) fastq parser is like the gold standard of fastq parsers. It will even parse a multiline fastq like
I guess that's why they're pretty slower than what one can do one their own
I don't consider that to be a proper FASTQ file... given that the quality line can basically be anything, unless the file uses 4 lines per record, there's no way parse it with 100% confidence.
As far as I know there is official standard of how a fastq file should look like.
Right, there's no way to parse it properly because "@" is a quality score value, but unfortunately that was the convention when FASTQ was first used. It was FASTA (which requires newlines every 120 char), plus an extra FASTA-like entry for the quality scores. Why good old CSV wasn't used i'll never know. I agree that no one should be making multi-line FASTQs any more, but i'm glad a parser exists on the off-chance you come across one :)
You could maybe parse the file in to acquire the first line it encounters that starts with a +, and then maybe even sanity check it with a regex that looks back a line or two to make sure its only preceded by sequence (i.e contains A|T|G|C|N). Once your parser finds that line, do what you want with it, then ignore anything that comes after.
You can find the SeqIO fastq code here, which is pretty interesting and nicely documented o.O
One of the reasons sequence ID (which used to be repeated on the third line) was removed (leaving only
+
behind) was to save space as files started becoming gigantic with increase in sequence yield. Now you want to start adding something back? /s