Question

Altering a python script to calculate GC content

0

Entering edit mode

3.7 years ago

dk0319 ▴ 70

#!/home/usr/anaconda3/bin/python3
import gzip
import sys
input_file, sequence_name, output_file = sys.argv[1:]
result = None
with gzip.open(input_file, "rt", encoding="utf-8") as file_in:
      for line_number, raw_line in enumerate(file_in):
            if raw_line.startswith(f">{sequence_name}")
                result = line_number
                break
if result is not None:
      with open(output_file, "a") as file_out:
            print(result, file=file_out)

I am interested in modifying this script so that when I input a sequence name that corresponds to a multisequence fasta file it will return the GC content as a ratio. I can easily do the math portion e.g G=input.count("G"), etc.

Where I am struggling is designating the actual sequence as my input. My fasta sequences are in this format

>PDBU01061878.1 Homo sapiens CAAPA_61878, whole genome shotgun sequence
ATGGAATGGATTCGAATGGAATGGAATAGATTGGAATCAAACAGAGTGGAATGGAATGCAATGAAATGGAATCGAAT

Any help would be appreciated

sequence • 1.1k views

ADD COMMENT • link updated 3.7 years ago by Mensur Dlakic ★ 28k • written 3.7 years ago by dk0319 ▴ 70

score 0 · Answer 1 · 2021-03-21

This script requires BioPython. It will print the name, length and GC content for each sequence. For your example sequence the output is:

PDBU01061878.1 77 0.3636

import os, sys
import pandas as pd
from Bio import SeqIO
from Bio.SeqUtils import GC

if os.access(sys.argv[1], os.R_OK):
    FastaFile = open(sys.argv[1], 'rU')
else:
    print('\n !!! Input file "%s" does not exist in this directory !!!\n' %
          sys.argv[1])
    sys.exit(1)

for rec in SeqIO.parse(FastaFile, 'fasta'):
    name = rec.id
    seq = rec.seq
    seqLen = len(rec)
    seqGC = GC(seq)
    print('%s %d %.4f' % (name, seqLen, seqGC/100.0))

FastaFile.close()