Entering edit mode
4.1 years ago
lorenzinip
•
0
Hi I have a large FASTA file which looks like this
>EMBOSS_001
GTCATCACAGTTTTCCCCGCCCTGTATATGGCTAATAGGCCCTCGCAATCTCCGATAAAT
>EMBOSS_002
CTGATGCTAGTCCCGTGTCCCAAACACTTCCGCAGAAGATCGCCCCGGGGGGCGTGTACC
>EMBOSS_003
CGCGCATGGACTCCATCCGTGATCTTTTGAGGCCATGAGTCCAAGTTTACCTCGGATATA
>EMBOSS_004
CGACCCGCCATTCTCCATCGTAACTTAGTCACGACGACAGTCAGCTTGTTCGTTCGTTAT
I would like to find all the sequences which have a specific motif and eliminate them. For example if the motifs is TTTCCC, the expected output should be:
>EMBOSS_002 CTGATGCTAGTCCCGTGTCCCAAACACTTCCGCAGAAGATCGCCCCGGGGGGCGTGTACC
>EMBOSS_003 CGCGCATGGACTCCATCCGTGATCTTTTGAGGCCATGAGTCCAAGTTTACCTCGGATATA
>EMBOSS_004 CGACCCGCCATTCTCCATCGTAACTTAGTCACGACGACAGTCAGCTTGTTCGTTCGTTAT
I wrote a code with Biopython:
from Bio.Seq import Seq
import Bio.motifs as motifs
from Bio import SeqIO
instances = [Seq("TTTCCC")]
m = motifs.create(instances)
reads = list(SeqIO.parse("/Users/EMBOSS-6.6.0/emboss/genome.fa", "fasta"))
for i in range(len(reads)):
for pos, seq in m.instances.search(reads[i].seq):
print("%i %s" % (pos, seq))
However it returns me only the info of the position of the start of the motifs, 11 TTTCCC
I would like to return the info of also the sequence where it was found:
EMBOSS_001 11 TTTCCC
In addition, I would like the code to eliminate that sequence where the motif was found.
Unless you're doing something more complicated you haven't told us, you don't need to use
Bio.motifs
.You can just string search:
You can change
rec
to aprint
or whatever you need to do with that specific record; calling therec
directly should print the summary. It's not clear to me if you need the indexes for the match since you've said you just intend to throw the record away, but if you do you can simply useindex
.(Code untested).