Question

How to manually trim FASTA file sequences and store it in a new FASTA file?

0

Entering edit mode

2.1 years ago

pubsurfted ▴ 40

Hello, I have a FASTA file that has reads like this:

>SRR5655563.745 745 length=126 (type=T,start=2,end=12,length=11,identity=81.8182%) CttttatttgttGTGTGTGAGGTTTGATTTGATGGGAAAATATCTTGAATCTGCGGCGAGGTTGGAAGAGCTATCGCGGATTGTGTCATCTGCTGCGAAGCCCAATAGGTCAAAGGGAACGCTACC

As you can see, there is information about the start and end of a sequence that I would like to trim off and keep the rest of it. e.g

The part I would like to trim:

> Ctttggtttcctttt

And after trimming, I would like to keep the rest of the read:

> GTGTGTGAGGTTTGATTTGATGGGAAAATATCTTGAATCTGCGGCGAGGTTGGAAGAGCTATCGCGGATTGTGTCATCTGCTGCGAAGCCCAATAGGTCAAAGGGAACGCTACC

So the new FASTA file will have reads like this:

>SRR5655563.745 745 length=126 (type=T,start=2,end=12,length=11,identity=81.8182%)
GTGTGTGAGGTTTGATTTGATGGGAAAATATCTTGAATCTGCGGCGAGGTTGGAAGAGCTATCGCGGATTGTGTCATCTGCTGCGAAGCCCAATAGGTCAAAGGGAACGCTACC

I have written a python script that takes a FASTA file as an input and outputs a trimmed FASTA file.

So far what I have written is this:

import numpy as np
import argparse
import re

def main():
    parser = argparse.ArgumentParser()

    parser.add_argument("--fasta", type=str, required=True, help="input fasta file")
    parser.add_argument("--output", type=str, required=True, help="output tail.trimmed.fasta file")

    args = parser.parse_args()

    out_fasta = open(args.tail.trimmed.fasta, 'w')

    for line in open(args.fasta):
        header = line.rstrip()
        read = next(args.fasta).rstrip()

    #Extract header into a list
        header_lst = []
        header_lst= header.split(" ")

        tmp_start_lst = []
        tmp_end_lst = []
        tmp_lst = []

        for i in header_lst:
            if i.startswith("(") and i.endswith(")"):
                tmp_lst.append(i.split(","))
        #print(tmp_lst)

    # Extract the start and end numbers in a list
        for i in tmp_lst:
            for str in i: 
                if str.startswith("start="):
                    tmp_start_lst.append(str)
                if str.startswith("end="):
                    tmp_end_lst.append(str)

        pattern = re.compile(r'\d+$')

        start_lst = []
        for str in tmp_start_lst:
            match = re.search(pattern, str)
            start_lst.append(int(match.group()))

        end_lst = []
        for str in tmp_end_lst:
            match = re.search(pattern, str)
            end_lst.append(int(match.group()))

        #Extract reads into a list to manipulate i.e. trim
        read_lst = []
        read_lst.append(read)

        new_lst = []

        for str in read_lst:
            for num_start in start_lst:
                for num_end in end_lst:
                    s = str.replace(str[(num_start-1):num_end],"")
                    new_lst.append(s)

I do get the trimmed reads in a list format, but what I'm having trouble figuring out is how to write these trimmed reads with their corresponding headers to a new FASTA file.

Also if there is any way that I could improve my code or employ better strategy to approach the problem, I would love to know.

Best wishes!

FASTA • 1.2k views

ADD COMMENT • link 2.1 years ago by pubsurfted ▴ 40

1

Entering edit mode

It's not forbidden to cross-post but it's generally appreciated to at least leave a link to the other community to avoid double-effort for users investing their time: https://bioinformatics.stackexchange.com/questions/20247/how-to-manually-trim-fasta-file-sequences-with-the-information-provided-in-the-h