Question

Glimmer 3 multi-extract input problem

0

Entering edit mode

10.4 years ago

thepunkoklos • 0

The thing is I have a multi FASTA file and I was hoping to extract the gene coding regions with Glimmer multi-extract. I have already used the glimmer3 script and got two files: a .predict and a .detail. Now, when I try to use multi-extract it just gives me an error. Multi-extract asks me for this:

USAGE:  multi-extract [options] <sequence-file> <coords>

Read multi-fasta-format <sequence-file> and extract from it the
subsequences specified by <coords>. By default, <coords>
is the name of a file containing lines of the form
  <id>  <tag>  <start>  <stop>  [<frame>] ...
<id> is the identifier for the subsequence
<tag> is the tag of the sequence in <sequence-file> from which
to extract the entry

Now, although the glimmer3 package itself doesn't tell you from where you're supposed to get your <coords> file I assume it is from the .predict file (though some biolinux website suggested that the long-orfs output would do. In any case long-orfs doesn't seem to work with multi fasta as it only extracts the orfs from the first contig in my file.). But then.... the .predict file doesn't have the right structure, for a start it doesn't even include an <id> column, it's something like this:

>contig-7
orf00002     1741      461 
orf00003     3381     1747 
>Wcontig-7000023
>Wcontig-11112
orf00001      426     2648 
orf00002     2710     4581 
orf00003     4569     5480 
orf00004     6990     6133 
orf00006     9180     7108 
orf00007    10201     9209 
orf00008    11663    10203 
orf00009    12489    11680 
orf00010    13153    12473 
orf00011    14382    13225 
orf00013    14715    15968 
orf00014    19868    16410 
>Wcontig-1674000002
orf00001     2995      637 
orf00002     2497     1166 
orf00003     2984     2529

Does anybody know if I'm doing something terribly wrong or do I have to apply some commands to the file in order for it to meet multi-extract rules?

extract glimmer • 4.0k views

ADD COMMENT • link updated 3.0 years ago by Ram 44k • written 10.4 years ago by thepunkoklos • 0

0

Entering edit mode

hi, have you solved this problem? I met the same problem as yours.

What do you do next?

ADD REPLY • link updated 3.0 years ago by Ram 44k • written 9.7 years ago by SilentGene ▴ 120

score 0 · Answer 1 · 2018-11-19

Sorry for bumping an old thread, but I found it while suffering a similar problem to the original poster, and found something of a solution, which I thought worth sharing. Note that this answer doesn't address the specific problem of getting Glimmer to do this job, but it does address the issue of doing this job itself - the output should be the same. Apologies if I should have posted this as a comment instead of an answer - feel free to tell me off and / or move this.

So my situation was that I wanted to extract a set of regions, using coordinates, from a multi-fasta file.

I extracted the sequences which contain the desired regions first (using a custom script, many are available), and wrote them to a new fasta file.

I originally tried to use Glimmer to extract the specific regions but hit the same problem as the original poster - the documentation isn't as clear as might be desired. I ended up abandoning Glimmer completely and instead used getfasta from BedTools. This is a help page on getfasta from BedTools' support website: https://bedtools.readthedocs.io/en/latest/content/tools/getfasta.html

There are several examples there showing the input file structures and the command to getfasta which allow you to do a variety of different things, starting with default behaviour and working up to some other stuff, but it's all very simple. All you need is a fasta file and a bed file. For a few sequences you could literally write the bed file manually (if you felt so inclined) - just make sure to use tab separators instead of space separators.

The simplest way of doing this is

$ cat test.fa
>chr1
AAAAAAAACCCCCCCCCCCCCGCTACTGGGGGGGGGGGGGGGGGG

$ cat test.bed
chr1 5 10

$ bedtools getfasta -fi test.fa -bed test.bed -fo test.fa.out

$ cat test.fa.out
>chr1:5-10
AAACC

The documentation in the link suggests that using the -fo argument to specifiy an output file is optional, but I found it necessary to specify this in order to get any output (without it I just got a message showing how to use getfasta). Of note also is the -name option, which will carry across a name specified in the 3rd column of the .bed file (which is blank in the above example).