Entering edit mode
4.8 years ago
schlogl
▴
160
Hi there
I have a compressed fasta file (really big) and I was thinking if it is possible to use the SimpleFastaParser(biopython), gzip.open to feed the sequeces into emboss compseq through cml?
Thank you
If you are thinking about
feeding
as instreaming in
then most likely not.You probably can use something like this
seqtk subseq in.fq name.lst > out.fq
, but pipe the results into the other program rather than to a file.I will checkit out if seqtk works with compressed files. Thank you
I don't see why it would not. Please check (https://github.com/lh3/seqtk) for more details.
You can use other options like
jellyfish
,kmercountexact.sh
from BBMap if you are only looking get unique k-mer counts from your file. These programs will work with compressed files directly (ones from BBMap will)..Hi genomax
jeelyfish works with proteins too? I don't think so, sorry I forgot to say that the file is fasta.aa...
Thanks for your time.
You could try this program. It was important to note that you have protein sequences.
I don't see any reason that your described approach wouldn't work. Have you already tried it?
You can try parallel. It can send in chunks of lines or bytes, of input file. But binary file I am not sure.
What is
cml
? Have you tried simple shell scripting - why do you need to use biopython?to read the compressed file together with gzip and then pass/stream to emboss each sequence at time.
And because a don't know much about bash scripting.
You can do everything you need to do inside (bio)python:
http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc54
You haven't said exactly how large your file is, and python is not famed for its speed, so even with the best code in the world I still can't guarantee that this will be fast enough for you.
It seems EMBOSS will not accept piped output however, so without much hacking, genomax's advice of finding a tool that natively supports compressed amino acid data would be preferable. Through python you could make many iterative subprocess calls though once biopython has parsed the entry out for you (not the same as truly piping the data though).
How big is
really big
?Hi Joe it is the same file we already discussed in another question... 67GB! I am counting 5mers using python, but I would like to use another tool as a confirmation.
I have some code to read the file as a generator, but I reading the emboss manual and about seqcomp.
I tried to pass the sequences like this:
As specified in here: http://www.csd.hku.hk/bruhk/emboss.html
The test file in this case has only one sequence, but i always got this error.
What I am doing wrong?
PS- You right Joe it seems EMBOSS it is not a good tool in my case.
I will try genomax suggestion.
Thanks anyway.
I'm not sure I understand what you're showing me (what is
::
supposed to be?)In order to pass to compseq you'll need to reconstitute a file to pass to
-sequence
. If you really want to do this with python still, take a look at thetempfile
module.The steps at a high level would be:
subprocess
to dispatch compseq jobs to the shellThis will be reasonably memory efficient I should think (though I'm not familiar with how python's gzip module works), but will be slowwww (if you have any high-speed storage, such as that which
/tmp/
is usually mounted to, I'd strongly advise using that )I used a similar approach here if you want some reference material: https://github.com/jrjhealey/Oread/blob/master/oread/__main__.py#L208-L246
Thats what I was thinking to do some similar like that...
I never worked with subprocess and tempfile modules, but I will check it out.
Thanks Joe.
That's a scary code 8)
I don't know if I am skilled enough to write some like that, but I will try.
Paulo
You don't need everything thats in the code I linked. If you just look at the documentation/examples for each of the modules, they'll show you what you need to as a minimum.
It will be tricky, as thats a lot of steps to glue together, but this is the sort of thing that python is actually particularly good at so this will be good practice.
Thanks Joe.
I appreciate you effort to help.
Thank you very much.