Hello, I have a FASTA file that has reads like this:
>SRR5655563.745 745 length=126 (type=T,start=2,end=12,length=11,identity=81.8182%) CttttatttgttGTGTGTGAGGTTTGATTTGATGGGAAAATATCTTGAATCTGCGGCGAGGTTGGAAGAGCTATCGCGGATTGTGTCATCTGCTGCGAAGCCCAATAGGTCAAAGGGAACGCTACC
As you can see, there is information about the start and end of a sequence that I would like to trim off and keep the rest of it. e.g
The part I would like to trim:
> Ctttggtttcctttt
And after trimming, I would like to keep the rest of the read:
> GTGTGTGAGGTTTGATTTGATGGGAAAATATCTTGAATCTGCGGCGAGGTTGGAAGAGCTATCGCGGATTGTGTCATCTGCTGCGAAGCCCAATAGGTCAAAGGGAACGCTACC
So the new FASTA file will have reads like this:
>SRR5655563.745 745 length=126 (type=T,start=2,end=12,length=11,identity=81.8182%)
GTGTGTGAGGTTTGATTTGATGGGAAAATATCTTGAATCTGCGGCGAGGTTGGAAGAGCTATCGCGGATTGTGTCATCTGCTGCGAAGCCCAATAGGTCAAAGGGAACGCTACC
I have written a python script that takes a FASTA file as an input and outputs a trimmed FASTA file.
So far what I have written is this:
import numpy as np
import argparse
import re
def main():
parser = argparse.ArgumentParser()
parser.add_argument("--fasta", type=str, required=True, help="input fasta file")
parser.add_argument("--output", type=str, required=True, help="output tail.trimmed.fasta file")
args = parser.parse_args()
out_fasta = open(args.tail.trimmed.fasta, 'w')
for line in open(args.fasta):
header = line.rstrip()
read = next(args.fasta).rstrip()
#Extract header into a list
header_lst = []
header_lst= header.split(" ")
tmp_start_lst = []
tmp_end_lst = []
tmp_lst = []
for i in header_lst:
if i.startswith("(") and i.endswith(")"):
tmp_lst.append(i.split(","))
#print(tmp_lst)
# Extract the start and end numbers in a list
for i in tmp_lst:
for str in i:
if str.startswith("start="):
tmp_start_lst.append(str)
if str.startswith("end="):
tmp_end_lst.append(str)
pattern = re.compile(r'\d+$')
start_lst = []
for str in tmp_start_lst:
match = re.search(pattern, str)
start_lst.append(int(match.group()))
end_lst = []
for str in tmp_end_lst:
match = re.search(pattern, str)
end_lst.append(int(match.group()))
#Extract reads into a list to manipulate i.e. trim
read_lst = []
read_lst.append(read)
new_lst = []
for str in read_lst:
for num_start in start_lst:
for num_end in end_lst:
s = str.replace(str[(num_start-1):num_end],"")
new_lst.append(s)
I do get the trimmed reads in a list format, but what I'm having trouble figuring out is how to write these trimmed reads with their corresponding headers to a new FASTA file.
Also if there is any way that I could improve my code or employ better strategy to approach the problem, I would love to know.
Best wishes!
It's not forbidden to cross-post but it's generally appreciated to at least leave a link to the other community to avoid double-effort for users investing their time: https://bioinformatics.stackexchange.com/questions/20247/how-to-manually-trim-fasta-file-sequences-with-the-information-provided-in-the-h
Noted. I didn't know and I'll keep that in mind for next time. thanks.