Hello, biostars!
I want to get familiar with MetaBAT binning pipeline (https://bitbucket.org/berkeleylab/metabat/wiki/Example_Large_Data). I download the dataset and try to follow the instruction like this:
- Run Ray with kmer=31 and bloom filter defaults
- run BBmap to Ray scaffolds >= 2500 bp, idfilter=90
- run megahit on unmapped reads and mates
- concatenate Ray >=2500 and megahit assemblies
My dataset is under the folder ~/data/ and I ran the command mpiexec -n 1 Ray -k 31 -detect-sequence-files ~/data/ -o "RayOutput
. In the folder "RayOutput", I found a file named "Scaffolds.fasta". I do not know whether this step is correct.
Then, I ran the command bbmap.sh in=Scaffolds.fasta out=trial.sam idfilter=0.9 minlen=2500
, but got a message like this:
Exception in thread "main" java.lang.RuntimeException: Can't find file ref/genome/1/summary.txt
at fileIO.ReadWrite.getRawInputStream(ReadWrite.java:815)
at fileIO.ReadWrite.getInputStream(ReadWrite.java:780)
at fileIO.TextFile.open(TextFile.java:241)
at fileIO.TextFile.<init>(TextFile.java:92)
at dna.Data.setGenome2(Data.java:823)
at dna.Data.setGenome(Data.java:769)
at align2.BBMap.loadIndex(BBMap.java:306)
at align2.BBMap.main(BBMap.java:31)
My operating system is Ubuntu 14,04LTS and I believe I have installed Ray and BBmap correctly. But I can't implement the second step of the instruction. I don't know where is wrong.
Could you give me any suggestions? Thanks.
Is the Scaffolds.fasta file empty or does it actually have something in it?
No, there are lots of sequences in Scaffolds.fasta. However, there is not a folder called "ref" in the folder "RayOutput". Maybe the lack of "ref" leads to this problem. Do you know how to generate that?
Although I download only part of the dataset, about 2.3GB, it is enough to process these steps. I just want to get familiar with the procedure and process my own data.
If anyone could provide a script running 4 steps correctly, it would be very helpful.
bbmap.sh ref=/path_to/scaffolds.fasta
will generate a bbmap index for that reference. It will be in a directory called "ref". There will be sub dir's and files under that directory.You can move that folder into the right spot (RayOutput?) when done.
And you will need to amend the bbmap command as follows
Thanks for your advice. I ran "bbmap.sh ref=scaffolds.fasta" and it generated a folder called "ref".
Then I ran the command "bbmap.sh in=Scaffolds.fasta out=trial.sam idfilter=0.9 minlen=10" and the result was:
It seems to work. By the way, how to interpret the result? Is there any instruction on some website?
What's more, I am confused with the step 3 "run megahit on unmapped reads and mates". I read the instruction on https://github.com/voutcn/megahit, but it did not mention anything related to unmapped reads. What is an unmapped reads? After I ran bbmap, I just got a file, "trial.sam". Could you please explain that? Thanks.
In this small example all reads appear to map to the scaffold.fasta. Stats from BBMap are self-explanatory. Do you have a question about something specific
If you want to collect unmapped reads then the bbmap command would need to be modified as
By default
out= contains both mapped and unmapped reads
. Explicitly addingoutm=
catches only mapped reads andoutu=
will contain unmapped reads. See more info about BBMap here.Thank you so much!
I ran the command
bbmap.sh in=Scaffolds.fasta out=mapped.sam outu=unmapped.bam idfilter=0.9 minlen=500
and got unmapped reads "unmapped.bam". Then I ran the commandmegahit -r unmapped.bam -o mega_unmap
, but failed. The error inmega_unmap/log
isMEGAHIT v1.0.4-beta-4-g58b0995 --- [Mon May 2 23:07:58 2016] Start assembly. Number of CPU threads 4 --- --- [Mon May 2 23:07:58 2016] Available memory: 8083988480, used: 7275589632 --- [Mon May 2 23:07:58 2016] k list: 21,41,61,81,99 --- --- [Mon May 2 23:07:58 2016] Converting reads to binaries --- /home/jin/biotools/megahit/megahit_asm_core buildlib mega_unmap/tmp/reads.lib mega_unmap/tmp/reads.lib [read_lib_functions-inl.h : 209] Lib 0 (unmapped.bam): se, 11257 reads, 19 max length [utils.h : 124] Real: 0.0037 user: 0.0039 sys: 0.0000 maxrss: 7368 --- [Mon May 2 23:07:58 2016] Extracting solid (k+1)-mers for k = 21 --- cmd: /home/biotools/megahit/megahit_sdbg_build count -k 21 -m 2 --host_mem 7275589632 --mem_flag 1 --gpu_mem 0 --output_prefix mega_unmap/tmp/k21/21 --num_cpu_threads 4 --num_output_threads 1 --read_lib_file mega_unmap/tmp/reads.lib [sdbg_builder.cpp : 116] Host memory to be used: 7275589632 [sdbg_builder.cpp : 117] Number CPU threads: 4 [cx1.h : 450] Preparing data... [read_lib_functions-inl.h : 256] Before reading, sizeof seq_package: 90072 [read_lib_functions-inl.h : 260] After reading, sizeof seq_package: 90072 [cx1_kmer_count.cpp : 136] 11257 reads, 19 max read length [cx1.h : 457] Preparing data... Done. Time elapsed: 0.0010 [cx1.h : 464] Preparing partitions and initialing global data... [cx1_kmer_count.cpp : 227] 2 words per substring, 2 words per edge [cx1_kmer_count.cpp : 364] Memory for reads: 90104 [cx1_kmer_count.cpp : 365] max # lv.1 items = 0 [cx1.h : 480] Preparing partitions and initialing global data... Done. Time elapsed: 0.0120 [cx1.h : 486] Start main loop... Bucket 65536 too large for lv1: 0 > 0 Error occurs when running "sdbg_builder count/read2sdbg", please refer to mega_unmap/log for detail [Exit code 1]
I have no idea how to deal with this error. Do you have any advice?
Look in
mega_unmap/log
file for details as the error message suggests.You may be running megahit wrong. See this link which has an error similar to yours.
Thanks for your advice.
I found I was too foolish because the input file of megahit should be fasta or fastq, but I used sam file as the input, leading to that error.
When I ran the command
bbmap.sh in=Scaffolds.fasta out=trial.sam outu=unmapped.sam idfilter=0.9 minlen=500
, I got a file named "unmapped.sam". But the content of "unmapped.sam" is:I did not see any sequences. Is that sam file a correct output of bbmap?
As we discussed above, in this toy example all sequences in your input appear to map to scaffolds.fasta so there are no sequence left unmapped. That is likely why your
unmapped.sam
file just contains headers and no sequences. You may want to start with a bigger dataset as input to get unmapped reads.Thanks for your patience. I will have a try. You really help a lot :)