Entering edit mode
3.7 years ago
dk0319
▴
70
#!/home/usr/anaconda3/bin/python3
import gzip
import sys
input_file, sequence_name, output_file = sys.argv[1:]
result = None
with gzip.open(input_file, "rt", encoding="utf-8") as file_in:
for line_number, raw_line in enumerate(file_in):
if raw_line.startswith(f">{sequence_name}")
result = line_number
break
if result is not None:
with open(output_file, "a") as file_out:
print(result, file=file_out)
I am interested in modifying this script so that when I input a sequence name that corresponds to a multisequence fasta file it will return the GC content as a ratio. I can easily do the math portion e.g G=input.count("G"), etc.
Where I am struggling is designating the actual sequence as my input. My fasta sequences are in this format
>PDBU01061878.1 Homo sapiens CAAPA_61878, whole genome shotgun sequence
ATGGAATGGATTCGAATGGAATGGAATAGATTGGAATCAAACAGAGTGGAATGGAATGCAATGAAATGGAATCGAAT
Any help would be appreciated