Question

what is the best way to cut a lot of embl file at given position in a way to obtain a smaller embl file ?

0

Entering edit mode

10.5 years ago

CrLs ▴ 10

Hello,

I have a lot of Embl files and i would like to get some sequence from it (for example 'from position : 19944 to 27500 in genome DC000456)). I know I can do it from NCBI but it will be file by file. There is any python/perl script who can do that automatically or some other site ?

Thanks in advance.

sequence • 3.8k views

ADD COMMENT • link updated 4.1 years ago by Ram 45k • written 10.5 years ago by CrLs ▴ 10

0

Entering edit mode

Is DC000456 a real EMBL accession ID for a genome? I don't find it using a web search.

ADD REPLY • link updated 4.1 years ago by Ram 45k • written 10.5 years ago by Neilfws 49k

Ram · Answer 1 · 2014-10-07

1

Entering edit mode

10.5 years ago

Neilfws 49k

There are libraries for both Python and Perl which include a subsequence method, so you can use them to write your own script.

For BioPython - Bio.SeqIO and Bio.SeqRecord; for BioPerl - search the beginners HOWTO for subseq.

ADD COMMENT • link updated 5.5 years ago by Ram 45k • written 10.5 years ago by Neilfws 49k

0

Entering edit mode

Hey, thank you for you answer.

I know i can write it myself. I just wanted to know if there was a know tool already out. I found the .extract() tool on biopython. I'll try it this way.

Charles

ADD REPLY • link updated 4.1 years ago by Ram 45k • written 10.5 years ago by CrLs ▴ 10

0

Entering edit mode

Generally if you search the web for solutions to these common sequence processing tasks, you'll find libraries for writing your own code rather than ready-made solutions. The closest thing to an existing tool might be biopieces, which I have not used (due to the many dependencies for installation), but which might go something iike this:

read_embl -i myfile.embl | extract_seq -b 19944 -e 27500

ADD REPLY • link updated 4.1 years ago by Ram 45k • written 10.5 years ago by Neilfws 49k

0

Entering edit mode

Yeah, I'm aware of that. I'll write it my self. But I guess, those kind of script have been made a lot of time by scientist. It's just a waste of time overall.

Anyway, thank you for your answer (and time)!

ADD REPLY • link updated 4.1 years ago by Ram 45k • written 10.5 years ago by CrLs ▴ 10

Ram · Answer 2 · 2014-10-28

1

Entering edit mode

10.5 years ago

bioslayer ▴ 50

The sfetch utility from biosquid can allow you to do that and specify the format you want to extract your sequences in, simply,

sfetch -d file.embl -f 19944 -t 27500 -F fasta .

Note the "." at the end.

You can as well rename the seq you've extracted

sfetch -d CP001581.1.embl -f 3 -t 20 -F fasta -r "extracted_19944_27500" .

ADD COMMENT • link updated 4.1 years ago by Ram 45k • written 10.5 years ago by bioslayer ▴ 50

0

Entering edit mode

Hello,

Thank you for answering. I just post a python script I did, I wasn't aware of biosquid. I'll think about it next time.

C.

ADD REPLY • link updated 4.1 years ago by Ram 45k • written 10.5 years ago by CrLs ▴ 10

Ram · Answer 3 · 2014-10-28

Hey hello,

Thx for you answer.

I wrote this litlle python script to do the job. Forgot to post it.

Fell free to use.

from Bio import Entrez, SeqIO
Entrez.email = "Youremail@email.com"
for i, j, start, end in [('AE009948','Nameoftheregion1', 80000, 90000),
                      ('AE009948','Nameoftheregion2', 100000, 110000)]:
    handle = Entrez.efetch(db="nucleotide", id=i, seq_start=start,
            seq_stop=end, rettype="gb")
    Fasta = open(j + '.fasta', 'a')
    for seq_record in SeqIO.parse(handle, "gb"):
        for feature in seq_record.features:
            if feature.type == "CDS":
                if 'translation' in feature.qualifiers:
                    CDS_seq = feature.qualifiers['translation'][0]
                    protid = feature.qualifiers['protein_id'][0]
                    Fasta.write(">" + protid + "_" + j + "\n" + str(CDS_seq) + "\n")
    Fasta.close()

This one, get a multifasta prot. not a fasta from the region.