Hello,
I am using GNU parallel to speed up my BLAST jobs. I have seen the example outlined in the following post (Gnu Parallel - Parallelize Serial Command Line Programs Without Changing Them) and used the command:
cat 1gb.fasta | parallel --block 100k --recstart '>' --pipe blastp -evalue 0.01 -db db.fa -query - > results
I am noticing that in the BLAST output generated, sequences are missing (~30 from 5000), and if I run parallel and just examine the blocks that are generated, it seems that parallel loses a certain number of records (fasta records) each time it creates a new block. It doesn't seem like the block is breaking at the correct place. Does anyone have any clue as to why this is happening? Any help is appreciated.
Thank you.
Yes, when I do that it is missing sequences. If I wc the original input and the parallel output, it goes from 10000 to 9588. In the parallel output, the first line starts as:
whereas in the original file, it started as:
etc.
I think there are other instances of missing sequences (where each block is made) but it is hard to find them without going through 10000 lines manually. Do you have any ideas how I could trouble shoot this? Parallel is extremely useful to me but with this little issue I cannot use it.
edit: I have found another instance where there is a skip in the read numbers and where the header is altered.
If that is true, you have found a bug. Can you make an example available for download? Quoting it here is unfortunately not enough as \n may be quoted wrongly.
Sure. I hope this is an appropriate download: http://ge.tt/6d3f9g62/v/0?c?c
I also tried parallel and piping to cat on a different fasta file with more simple headers (to see if there was an issue in the header of the original file) but it would still do the same thing.
Also, the exact command I use is:
It gives exactly the same on 3 of my systems:
So what is hitting you is something on your local system. This changes the bug from simple fix to harder debugging, and that should not be done on Biostars.org. Post to bug-parallel@gnu.org and follow "REPORTING BUGS" in 'man parallel'.
Okay thank you very much.
Did you ever find a resolution to this issue? I have also experienced the same issue GNU parallel 20160422