Question

Gc Content In Mate Pairs / Pair-End

0

Entering edit mode

13.4 years ago

Ric ▴ 440

Hello, In BioPython it is possible to calculate the GC content from a sequence in the following way:

>>> from Bio.SeqUtils import GC
>>> GC("ACTGN")
40.0

However, how to do it with mate pair reads? Should, I only consider the A read only and ignore the B read?

Thank you in advance.

biopython paired gc • 4.9k views

ADD COMMENT • link updated 8.4 years ago by Biostar 20 • written 13.4 years ago by Ric ▴ 440

score 1 · Answer 1 · 2012-03-07

You could use the .next() method of python's iterator to go over two sequences at a time with something like this:

from Bio import SeqIO
handle1 = open("example_1.fq", "rU")
handle2 = open("example_2.fq", "rU")
itter1 = SeqIO.parse(handle1, "fastq")
itter2 = SeqIO.parse(handle2, "fastq")
for record1 in itter1 :
    record2 = itter2.next()
    #Calculate GC on record 1 and record 2
handle1.close()
handle2.close()

This will work as long as the reads are sorted in your two fastq files. This is pretty typical for what comes off of the illumina machine. Another option if your fastq file of mates is interleaved, you could do something similar but with a single handle:

from Bio import SeqIO
handle = open("example_interleaved.fq", "rU")
fq_itter = SeqIO.parse(handle, "fastq")
for record1 in fq_itter:
    record2 = fq_itter.next()
    #Calculate GC on record 1 and record 2
handle.close()

Since for bla in bla_itter just calls .next() over and over again on bla_itter you don't get issues with seeing the same record twice, if you call .next inside of the inner loop it moves the outer loop forward an extra position as well. That might not be technically correct, but that is how I have seen python behave.

score 0 · Answer 2 · 2012-03-07

0

Entering edit mode

13.4 years ago

Fidel ★ 2.0k

Depends on what you want. If you want to compute the GC content of each read, then take into account both reads. This is what quality control software like FastQC does.

If you want to compute the GC content of the fragment defined by the paired reads, then use that sequence instead.

ADD COMMENT • link 13.4 years ago by Fidel ★ 2.0k

0

Entering edit mode

Thank you for your response.

READ a: @aaa ACTG + hhsl

READ b: @bbb TGCT + hhsl

Does FastQC concatenate READ a and b together in the following way:

READ a + b: @aaa+bbb ACTGTGCT + hhslhhsl

and it calculate the GC content?

What do you mean to compute the GC content of the fragment defined by the paired reads and for what could I used it?

ADD REPLY • link 13.4 years ago by Ric ▴ 440

0

Entering edit mode

Thank you for your response.

READ a: @aaa ACTG + hhsl

READ b: @bbb TGCT + hhsl

Does FastQC concatenate READ a and b together in the following way:

READ a + b: @aaa+bbb ACTGTGCT + hhslhhsl

and it calculate the GC content?

What do you mean to compute the GC content of the fragment defined by the paired reads and for what could I used it?

ADD REPLY • link 13.4 years ago by Ric ▴ 440

0

Entering edit mode

As far as I know FastQC takes each read independently. I see no reason for concatenating the two reads one after the other. If you have mate pairs, from paired-end sequencing assay then, they represent the two extremes of an original fragment. See http://www.illumina.com/technology/paired_end_sequencing_assay.ilmn

Can you explain why do you want to compute GC content?

ADD REPLY • link 13.4 years ago by Fidel ★ 2.0k