Altering a python script to calculate GC content
0
0
Entering edit mode
3.7 years ago
dk0319 ▴ 70
#!/home/usr/anaconda3/bin/python3
import gzip
import sys
input_file, sequence_name, output_file = sys.argv[1:]
result = None
with gzip.open(input_file, "rt", encoding="utf-8") as file_in:
      for line_number, raw_line in enumerate(file_in):
            if raw_line.startswith(f">{sequence_name}")
                result = line_number
                break
if result is not None:
      with open(output_file, "a") as file_out:
            print(result, file=file_out)

I am interested in modifying this script so that when I input a sequence name that corresponds to a multisequence fasta file it will return the GC content as a ratio. I can easily do the math portion e.g G=input.count("G"), etc.

Where I am struggling is designating the actual sequence as my input. My fasta sequences are in this format

PDBU01061878.1 Homo sapiens CAAPA_61878, whole genome shotgun sequence
ATGGAATGGATTCGAATGGAATGGAATAGATTGGAATCAAACAGAGTGGAATGGAATGCAATGAAATGGAATCGAAT

Any help would be appreciated

sequence • 511 views
ADD COMMENT

Login before adding your answer.

Traffic: 1727 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6