There are any way to use a script for feed emboss with fasta sequence?
0
0
Entering edit mode
4.9 years ago
schlogl ▴ 160

Hi there
I have a compressed fasta file (really big) and I was thinking if it is possible to use the SimpleFastaParser(biopython), gzip.open to feed the sequeces into emboss compseq through cml?

Thank you

sequence • 2.7k views
ADD COMMENT
1
Entering edit mode

If you are thinking about feeding as in streaming in then most likely not.

ADD REPLY
1
Entering edit mode

You probably can use something like this seqtk subseq in.fq name.lst > out.fq, but pipe the results into the other program rather than to a file.

ADD REPLY
0
Entering edit mode

I will checkit out if seqtk works with compressed files. Thank you

ADD REPLY
1
Entering edit mode

I don't see why it would not. Please check (https://github.com/lh3/seqtk) for more details.

ADD REPLY
1
Entering edit mode

You can use other options like jellyfish, kmercountexact.sh from BBMap if you are only looking get unique k-mer counts from your file. These programs will work with compressed files directly (ones from BBMap will)..

ADD REPLY
0
Entering edit mode

Hi genomax

jeelyfish works with proteins too? I don't think so, sorry I forgot to say that the file is fasta.aa...

Thanks for your time.

ADD REPLY
0
Entering edit mode

You could try this program. It was important to note that you have protein sequences.

ADD REPLY
1
Entering edit mode

I don't see any reason that your described approach wouldn't work. Have you already tried it?

ADD REPLY
1
Entering edit mode

You can try parallel. It can send in chunks of lines or bytes, of input file. But binary file I am not sure.

ADD REPLY
0
Entering edit mode

What is cml? Have you tried simple shell scripting - why do you need to use biopython?

ADD REPLY
0
Entering edit mode

to read the compressed file together with gzip and then pass/stream to emboss each sequence at time.

And because a don't know much about bash scripting.

ADD REPLY
1
Entering edit mode

You can do everything you need to do inside (bio)python:

http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc54

You haven't said exactly how large your file is, and python is not famed for its speed, so even with the best code in the world I still can't guarantee that this will be fast enough for you.

It seems EMBOSS will not accept piped output however, so without much hacking, genomax's advice of finding a tool that natively supports compressed amino acid data would be preferable. Through python you could make many iterative subprocess calls though once biopython has parsed the entry out for you (not the same as truly piping the data though).

ADD REPLY
0
Entering edit mode

How big is really big?

ADD REPLY
0
Entering edit mode

Hi Joe it is the same file we already discussed in another question... 67GB! I am counting 5mers using python, but I would like to use another tool as a confirmation.

I have some code to read the file as a generator, but I reading the emboss manual and about seqcomp.

I tried to pass the sequences like this:

Calculate the composition of unique words in sequences
Input sequence(s): FASTA::hiv
Error: Failed to open filename 'hiv'
Error: Unable to read sequence 'FASTA::hiv'
Input sequence(s): fasta::hiv.txt
Error: Failed to open filename 'hiv.txt'
Error: Unable to read sequence 'fasta::hiv.txt'
Died: compseq terminated: Bad value for '-sequence' and no more retries

As specified in here: http://www.csd.hku.hk/bruhk/emboss.html

The test file in this case has only one sequence, but i always got this error.

What I am doing wrong?

PS- You right Joe it seems EMBOSS it is not a good tool in my case.
I will try genomax suggestion.

Thanks anyway.

ADD REPLY
1
Entering edit mode

I'm not sure I understand what you're showing me (what is :: supposed to be?)

In order to pass to compseq you'll need to reconstitute a file to pass to -sequence. If you really want to do this with python still, take a look at the tempfile module.

The steps at a high level would be:

  1. Read compressed file
  2. Use biopython to get 1 record at a time
  3. Create a temporary file as a fasta or similar via biopython+tempfile
  4. Use tempfile with subprocess to dispatch compseq jobs to the shell
  5. Optionally purge the tempfile once the output file is generated.

This will be reasonably memory efficient I should think (though I'm not familiar with how python's gzip module works), but will be slowwww (if you have any high-speed storage, such as that which /tmp/ is usually mounted to, I'd strongly advise using that )

I used a similar approach here if you want some reference material: https://github.com/jrjhealey/Oread/blob/master/oread/__main__.py#L208-L246

ADD REPLY
0
Entering edit mode

Thats what I was thinking to do some similar like that...

I never worked with subprocess and tempfile modules, but I will check it out.

Thanks Joe.

ADD REPLY
0
Entering edit mode

That's a scary code 8)

I don't know if I am skilled enough to write some like that, but I will try.

Paulo

ADD REPLY
1
Entering edit mode

You don't need everything thats in the code I linked. If you just look at the documentation/examples for each of the modules, they'll show you what you need to as a minimum.

It will be tricky, as thats a lot of steps to glue together, but this is the sort of thing that python is actually particularly good at so this will be good practice.

ADD REPLY
0
Entering edit mode

Thanks Joe.

I appreciate you effort to help.

Thank you very much.

ADD REPLY

Login before adding your answer.

Traffic: 1802 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6