How to manually trim FASTA file sequences and store it in a new FASTA file?
0
0
Entering edit mode
23 months ago
pubsurfted ▴ 40

Hello, I have a FASTA file that has reads like this:

>SRR5655563.745 745 length=126 (type=T,start=2,end=12,length=11,identity=81.8182%) CttttatttgttGTGTGTGAGGTTTGATTTGATGGGAAAATATCTTGAATCTGCGGCGAGGTTGGAAGAGCTATCGCGGATTGTGTCATCTGCTGCGAAGCCCAATAGGTCAAAGGGAACGCTACC

As you can see, there is information about the start and end of a sequence that I would like to trim off and keep the rest of it. e.g

The part I would like to trim:

> Ctttggtttcctttt

And after trimming, I would like to keep the rest of the read:

> GTGTGTGAGGTTTGATTTGATGGGAAAATATCTTGAATCTGCGGCGAGGTTGGAAGAGCTATCGCGGATTGTGTCATCTGCTGCGAAGCCCAATAGGTCAAAGGGAACGCTACC

So the new FASTA file will have reads like this:

>SRR5655563.745 745 length=126 (type=T,start=2,end=12,length=11,identity=81.8182%)
GTGTGTGAGGTTTGATTTGATGGGAAAATATCTTGAATCTGCGGCGAGGTTGGAAGAGCTATCGCGGATTGTGTCATCTGCTGCGAAGCCCAATAGGTCAAAGGGAACGCTACC

I have written a python script that takes a FASTA file as an input and outputs a trimmed FASTA file.

So far what I have written is this:

import numpy as np
import argparse
import re

def main():
    parser = argparse.ArgumentParser()

    parser.add_argument("--fasta", type=str, required=True, help="input fasta file")
    parser.add_argument("--output", type=str, required=True, help="output tail.trimmed.fasta file")

    args = parser.parse_args()

    out_fasta = open(args.tail.trimmed.fasta, 'w')

    for line in open(args.fasta):
        header = line.rstrip()
        read = next(args.fasta).rstrip()

    #Extract header into a list
        header_lst = []
        header_lst= header.split(" ")

        tmp_start_lst = []
        tmp_end_lst = []
        tmp_lst = []

        for i in header_lst:
            if i.startswith("(") and i.endswith(")"):
                tmp_lst.append(i.split(","))
        #print(tmp_lst)

    # Extract the start and end numbers in a list
        for i in tmp_lst:
            for str in i: 
                if str.startswith("start="):
                    tmp_start_lst.append(str)
                if str.startswith("end="):
                    tmp_end_lst.append(str)

        pattern = re.compile(r'\d+$')

        start_lst = []
        for str in tmp_start_lst:
            match = re.search(pattern, str)
            start_lst.append(int(match.group()))

        end_lst = []
        for str in tmp_end_lst:
            match = re.search(pattern, str)
            end_lst.append(int(match.group()))

        #Extract reads into a list to manipulate i.e. trim
        read_lst = []
        read_lst.append(read)

        new_lst = []

        for str in read_lst:
            for num_start in start_lst:
                for num_end in end_lst:
                    s = str.replace(str[(num_start-1):num_end],"")
                    new_lst.append(s)

I do get the trimmed reads in a list format, but what I'm having trouble figuring out is how to write these trimmed reads with their corresponding headers to a new FASTA file.

Also if there is any way that I could improve my code or employ better strategy to approach the problem, I would love to know.

Best wishes!

FASTA • 1.1k views
ADD COMMENT
1
Entering edit mode

It's not forbidden to cross-post but it's generally appreciated to at least leave a link to the other community to avoid double-effort for users investing their time: https://bioinformatics.stackexchange.com/questions/20247/how-to-manually-trim-fasta-file-sequences-with-the-information-provided-in-the-h

ADD REPLY
1
Entering edit mode

Noted. I didn't know and I'll keep that in mind for next time. thanks.

ADD REPLY

Login before adding your answer.

Traffic: 2686 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6