Replace a FASTA sequence according to its id with gaps
1
0
Entering edit mode
4.7 years ago
yoshikal • 0

I have 2000 fasta sequence in a file, like this:

 >T1        
AQSFDRATVFEARRAGYQRESRVALGKSTGVLEWHVFHVWAPRETTILVETLSQIECSGKGIADRRQENPLTI------ATAI--TSQLLELVDIVIMDFKAITQFFL     
>T484      
AQSFDRATVFEARRAGYQREARVALGKSTGKLEWQVFHVWAPRETTILVETLSQLECSGKGIADRRQENPLKI------ATAI--TSQLLELVDIVIMDFKAITQFFL     
>T582      
AQSFDRATVFEKRRAGYQREARVALGKSTGKLEWQVFHVWAPRETTILVETLSQLECSGKGIADRRQENPLKI------ATAI--TSQLLELVDIVIMDFKAITQFFL     
>T1424     
AQSFDRATVFEKRRAGYQLEARVALGKSTGKLEWQVFHVWAPRETTILVETLSQLECSGKGIANRRQENPLKI------ATAI--TSQLLELVDILIMDFKAITQFFL     
>T1552     
AQSFDRATIFEKRRAGYQIEARVALGKSTGKLEWQVYHAWAPRETTILVETLSQLENAGKGVANRRHENPLKI------ATAI--TSQLLELVDILMMDFKAITQFFL     
.
.
.

I randomly draw f (f = 2) sequence from N (N = 2000).

For example if I have drawn 120 and 1500, I get the Fasta sequence:

> T1424
AQSFDRATVFEKRRAGYQLEARVALGKSTGKLEWQVFHVWAPRETTILVETLSQLECSGKGIANRRQENPLKI ------ ATAI - TSQLLELVDILIMDFKAITQFFL
> T1552
AQSFDRATIFEKRRAGYQIEARVALGKSTGKLEWQVYHAWAPRETTILVETLSQLENAGKGVANRRHENPLKI ------ ATAI - TSQLLELVDILMMDFKAITQFFL

What I want to do is replace these sequences with gaps of length 55:

> T1424
-----------------------------------------------------------------------------------------------------------------------------
> T1552
-----------------------------------------------------------------------------------------------------------------------------

I tried with this code:

from Bio import SeqIO 
from random import *

records = list(SeqIO.parse("fichier1.txt", "fasta")) 
#print(records[0].id)  # first record
#print(records[0].seq)

N=len(records) #2000 

f=2 

l=[] 

for i in range(f):
    x=randint(1, N)
    l.append(x)

d={} 
for i2 in l:
    print(records[i2].id,records[i2].seq)
    d[str(records[i2].id)]=str(records[i2].seq)
    with open("fichier1.txt") as fichier, open("newfile.txt", "w") as newfile:
        texte = fichier.read()

        new_text += texte.replace(str(records[i2].seq), "---------------------------------------")
        nouveaufichier.write(new_text)

print(d)

What does not work because sometimes there can be the same sequence in the file but with a different identifier.

From the ID and the corresponding sequence, I would like to change the sequence to introduce gaps.

Python Fasta • 684 views
ADD COMMENT
0
Entering edit mode

Switch to BioPython and replace the sequence based on the identifier. This seems to be a very specific "solve my code problem" question. What is the purpose of this task?

EDIT: Just saw that you're already using BioPython.

ADD REPLY
1
Entering edit mode
4.7 years ago
Joe 22k

As easy as:

Code

from Bio import SeqIO

seqs = list(SeqIO.parse("seqs.fa", "fasta"))

for rec in seqs:
    if rec.id == "chimpanze":
        rec.seq = "-"*len(rec.seq)
    print(">{}\n{}".formatrec.id,rec.seq))

Input data

>mutant
GTTGGGAGGCTATGTGTTGACTGGAAGGACGTCCTGTCGGGTGGCGAGAAGCAGAGAATC
>gorrila
GTTGGGAGGCTATGTGTGACTGGAAGGACATCCTGTCGGGTGGCGAGAAGCAGAGAATC
>chimpanze
GTTGGGAGGCTGTGTGTGACTGGAAGGACGTCCTGTCGGGTGGCGAGAAGCAGAGAATC
>human
GTTGGGAGGCTATGTGTGACTGGAAGGACGTCCTGTCGGGTGGCGAGAAGCAGAGAATC
>olive
GTTGGGAGGCTATGTGTGACTGGAAGGACGTCCTGTCGGGTGGCGAGAAGCAAAGAATC

Run code

>mutant
GTTGGGAGGCTATGTGTTGACTGGAAGGACGTCCTGTCGGGTGGCGAGAAGCAGAGAATC
>gorrila
GTTGGGAGGCTATGTGTGACTGGAAGGACATCCTGTCGGGTGGCGAGAAGCAGAGAATC
>chimpanze
-----------------------------------------------------------
>human
GTTGGGAGGCTATGTGTGACTGGAAGGACGTCCTGTCGGGTGGCGAGAAGCAGAGAATC
>olive
GTTGGGAGGCTATGTGTGACTGGAAGGACGTCCTGTCGGGTGGCGAGAAGCAAAGAATC

You can probably add the logic for your random n choose k bit.

ADD COMMENT

Login before adding your answer.

Traffic: 3040 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6