what is the best way to cut a lot of embl file at given position in a way to obtain a smaller embl file ?
3
0
Entering edit mode
10.1 years ago
CrLs ▴ 10

Hello,

I have a lot of Embl files and i would like to get some sequence from it (for example 'from position : 19944 to 27500 in genome DC000456)). I know I can do it from NCBI but it will be file by file. There is any python/perl script who can do that automatically or some other site ?

Thanks in advance.

sequence • 3.6k views
ADD COMMENT
0
Entering edit mode

Is DC000456 a real EMBL accession ID for a genome? I don't find it using a web search.

ADD REPLY
1
Entering edit mode
10.1 years ago
Neilfws 49k

There are libraries for both Python and Perl which include a subsequence method, so you can use them to write your own script.

For BioPython - Bio.SeqIO and Bio.SeqRecord; for BioPerl - search the beginners HOWTO for subseq.

ADD COMMENT
0
Entering edit mode

Hey, thank you for you answer.

I know i can write it myself. I just wanted to know if there was a know tool already out. I found the .extract() tool on biopython. I'll try it this way.

Charles

ADD REPLY
0
Entering edit mode

Generally if you search the web for solutions to these common sequence processing tasks, you'll find libraries for writing your own code rather than ready-made solutions. The closest thing to an existing tool might be biopieces, which I have not used (due to the many dependencies for installation), but which might go something iike this:

read_embl -i myfile.embl | extract_seq -b 19944 -e 27500
ADD REPLY
0
Entering edit mode

Yeah, I'm aware of that. I'll write it my self. But I guess, those kind of script have been made a lot of time by scientist. It's just a waste of time overall.

Anyway, thank you for your answer (and time)!

ADD REPLY
1
Entering edit mode
10.1 years ago
bioslayer ▴ 50

The sfetch utility from biosquid can allow you to do that and specify the format you want to extract your sequences in, simply,

sfetch -d file.embl -f 19944 -t 27500 -F fasta .

Note the "." at the end.

You can as well rename the seq you've extracted

sfetch -d CP001581.1.embl -f 3 -t 20 -F fasta -r "extracted_19944_27500" .
ADD COMMENT
0
Entering edit mode

Hello,

Thank you for answering. I just post a python script I did, I wasn't aware of biosquid. I'll think about it next time.

C.

ADD REPLY
0
Entering edit mode
10.1 years ago
CrLs ▴ 10

Hey hello,

Thx for you answer.

I wrote this litlle python script to do the job. Forgot to post it.

Fell free to use.

from Bio import Entrez, SeqIO
Entrez.email = "Youremail@email.com"
for i, j, start, end in [('AE009948','Nameoftheregion1', 80000, 90000),
                      ('AE009948','Nameoftheregion2', 100000, 110000)]:
    handle = Entrez.efetch(db="nucleotide", id=i, seq_start=start,
            seq_stop=end, rettype="gb")
    Fasta = open(j + '.fasta', 'a')
    for seq_record in SeqIO.parse(handle, "gb"):
        for feature in seq_record.features:
            if feature.type == "CDS":
                if 'translation' in feature.qualifiers:
                    CDS_seq = feature.qualifiers['translation'][0]
                    protid = feature.qualifiers['protein_id'][0]
                    Fasta.write(">" + protid + "_" + j + "\n" + str(CDS_seq) + "\n")
    Fasta.close()

This one, get a multifasta prot. not a fasta from the region.

ADD COMMENT

Login before adding your answer.

Traffic: 1844 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6