Update third line in fastq file using Biopython
4
0
Entering edit mode
7.9 years ago
spaladug ▴ 10

Is there a way to update the third line in a fastq file (i.e, lines starting with +) programmatically using Biopython? I can parse the file using SeqIO but I don't know which field to update.

Thanks

fastq • 4.0k views
ADD COMMENT
1
Entering edit mode

This sounds like an XY problem. Can you clarify what you are trying to accomplish?

Also, this is a question, not a forum post, so I have modified it.

ADD REPLY
0
Entering edit mode

I need to store information pertaining to each read in the third line. We have other tools in the pipeline that read the third line in each sequence read to compute some stats.

ADD REPLY
0
Entering edit mode

Sounds like you are making a custom, incompatible version of fastq. Is that correct? Also, you have not explained what you want to accomplish.

ADD REPLY
0
Entering edit mode

Not an incompatible version of fastq, I will be adding info next to + sign, so all the other aligner tools will still work as most of the tools ignore info in the third line. We have tools in our own custom pipeline that read info next to the + sign to compute stats.

ADD REPLY
1
Entering edit mode

From Wikipedia:

"Line 3 begins with a '+' character and is optionally followed by the same sequence identifier (and any description) again."

What does this mean? Basically, line 3 can have only a "+", or a "+" followed by the exact same information as the first line. Is it stupid? Yes, it is. But that's the format. So, you can create whatever kind of custom files you want, but they won't be FASTQ if you add arbitrary information to the 3rd line. It will be difficult to find existing programs that support your proposed custom format.

ADD REPLY
3
Entering edit mode

In this situation, I recommend adding tab-delimited fields to the first line, as that will not violate the FASTQ spec. The FASTQ spec does not, in fact, officially exist, but I'm sure that it still would not like to be violated.

ADD REPLY
2
Entering edit mode

I actually read recently somewhere that the SEQID is supposed to match [A-Za-z0-9_.:-]+ but if any format liked being violated, it would be FASTQ

ADD REPLY
1
Entering edit mode

Still a bit language-specific, though. Regexes are inherently language-specific so I don't consider them to be universally descriptive.

ADD REPLY
1
Entering edit mode

Do you absolutely need to do this in biopython/standard python?

Something like this would work for 'normal' python:

with open(fastq, 'r') as fqh:
    for line in fqh:
        if line.startswith('+'):
             # do stuff with line
ADD REPLY
1
Entering edit mode

Always more convenient to use an existing parser, although I agree it's pretty easy in this case.

ADD REPLY
1
Entering edit mode

True, unless you're trying to avoid dependencies ;)

OP hasn't made it clear why/if they have to use Biopython however so I imagine they aren't that worried about dependency

ADD REPLY
0
Entering edit mode

Actually, this comment C: Update third line in fastq file using Biopython also applies to your short piece of code... Quality scores can begin with a '+' which would get nasty :p

ADD REPLY
0
Entering edit mode

Yeah I did see that ;) that's a good catch. I assume Biopython's parsers deal with it numerically (the n-th line or something) rather than just looking for string matches...?

ADD REPLY
1
Entering edit mode

I haven't looked into their code, but I assume the parsing is sound. If I'm not mistaken, in general Biopython parsers are a bit slower than your own out-of-the-box solutions, presumably to take everything into account and don't assume anything about the structure of the input file. But checking the biopython code could clear this up...

ADD REPLY
1
Entering edit mode

Yeah the Bio-whatever (biopython, bioperl, etc) fastq parser is like the gold standard of fastq parsers. It will even parse a multiline fastq like

@myDNA
ACTACGACTACGC
ACTACGACTACGC
+
!BBBCDCCCCCGG
C12IIIIIIIIII

I guess that's why they're pretty slower than what one can do one their own

ADD REPLY
1
Entering edit mode

I don't consider that to be a proper FASTQ file... given that the quality line can basically be anything, unless the file uses 4 lines per record, there's no way parse it with 100% confidence.

ADD REPLY
0
Entering edit mode

As far as I know there is official standard of how a fastq file should look like.

ADD REPLY
0
Entering edit mode

Right, there's no way to parse it properly because "@" is a quality score value, but unfortunately that was the convention when FASTQ was first used. It was FASTA (which requires newlines every 120 char), plus an extra FASTA-like entry for the quality scores. Why good old CSV wasn't used i'll never know. I agree that no one should be making multi-line FASTQs any more, but i'm glad a parser exists on the off-chance you come across one :)

ADD REPLY
0
Entering edit mode

You could maybe parse the file in to acquire the first line it encounters that starts with a +, and then maybe even sanity check it with a regex that looks back a line or two to make sure its only preceded by sequence (i.e contains A|T|G|C|N). Once your parser finds that line, do what you want with it, then ignore anything that comes after.

ADD REPLY
1
Entering edit mode

You can find the SeqIO fastq code here, which is pretty interesting and nicely documented o.O

ADD REPLY
0
Entering edit mode

One of the reasons sequence ID (which used to be repeated on the third line) was removed (leaving only + behind) was to save space as files started becoming gigantic with increase in sequence yield. Now you want to start adding something back? /s

ADD REPLY
1
Entering edit mode
7.9 years ago
spaladug ▴ 10

Here is my python based solution to this, if anyone else is looking to modify the line with "+" sign:

with open(fh, 'w') as unique_file:
     for record in SeqIO.parse(readsFile, "fastq"):
          rec = record.format("fastq").splitlines()
          #compute your stats for this rec
         ....
          #
          rec[2] = rec[2] + "Store your stats pertaining to the record"
          result = "\n".join(rec) + "\n"
          fh.write(result)

Not an elegant solution. But gets the work done in my case.

-Cheers

ADD COMMENT
0
Entering edit mode

Can you check your code again? This doesn't seem right. You never actually use the record variable. Also, note that your file is now no longer a valid fastq (as stated above) and I expect if you try to parse this again with SeqIO you'll see dragons.

ADD REPLY
0
Entering edit mode

sorry, fixed the code. it should be record.format, It looks like most of the aligners ignore the third line and I haven't faced any problems with this format in our pipeline. You are right I cannot use SeqIO anymore on this record. But I don't intend to use Biopython after this step is done.

ADD REPLY
0
Entering edit mode
7.9 years ago
nim.1111.ou ▴ 180

what do you mean by update? i love using bash for editing files

sed 's/^+/+something/g'

//edit the lines start with a '+' into start with '+something'

ADD COMMENT
0
Entering edit mode

Although that is technically correct to update ALL third lines, this is unlikely going to help OP to specifically edit each third line separately and specifically.

ADD REPLY
1
Entering edit mode

I'm not sure about that... quality lines can basically start with anything.

ADD REPLY
0
Entering edit mode

Yes, that would be nasty. Forget I said anything :-D

ADD REPLY
0
Entering edit mode
7.9 years ago

Do a modulus to get every third line:

$ awk '{if (NR%4==3) { /* do something with $0 */ }}' in.fq > out.fq

Biowhatever libraries are way overkill for basic text munging. And often slow and complicated.

ADD COMMENT

Login before adding your answer.

Traffic: 1833 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6