Question

How to use Ray and BBmap to generate metagenome co-assembly

2

Entering edit mode

8.7 years ago

thustar ▴ 130

Hello, biostars!

I want to get familiar with MetaBAT binning pipeline (https://bitbucket.org/berkeleylab/metabat/wiki/Example_Large_Data). I download the dataset and try to follow the instruction like this:

Run Ray with kmer=31 and bloom filter defaults
run BBmap to Ray scaffolds >= 2500 bp, idfilter=90
run megahit on unmapped reads and mates
concatenate Ray >=2500 and megahit assemblies

My dataset is under the folder ~/data/ and I ran the command mpiexec -n 1 Ray -k 31 -detect-sequence-files ~/data/ -o "RayOutput . In the folder "RayOutput", I found a file named "Scaffolds.fasta". I do not know whether this step is correct.

Then, I ran the command bbmap.sh in=Scaffolds.fasta out=trial.sam idfilter=0.9 minlen=2500, but got a message like this:

Exception in thread "main" java.lang.RuntimeException: Can't find file ref/genome/1/summary.txt
    at fileIO.ReadWrite.getRawInputStream(ReadWrite.java:815)
    at fileIO.ReadWrite.getInputStream(ReadWrite.java:780)
    at fileIO.TextFile.open(TextFile.java:241)
    at fileIO.TextFile.<init>(TextFile.java:92)
    at dna.Data.setGenome2(Data.java:823)
    at dna.Data.setGenome(Data.java:769)
    at align2.BBMap.loadIndex(BBMap.java:306)
    at align2.BBMap.main(BBMap.java:31)

My operating system is Ubuntu 14,04LTS and I believe I have installed Ray and BBmap correctly. But I can't implement the second step of the instruction. I don't know where is wrong.

Could you give me any suggestions? Thanks.

Ray BBmap assembly • 4.0k views

ADD COMMENT • link updated 8.7 years ago by GenoMax 148k • written 8.7 years ago by thustar ▴ 130

1

Entering edit mode

Is the Scaffolds.fasta file empty or does it actually have something in it?

ADD REPLY • link 8.7 years ago by GenoMax 148k

0

Entering edit mode

No, there are lots of sequences in Scaffolds.fasta. However, there is not a folder called "ref" in the folder "RayOutput". Maybe the lack of "ref" leads to this problem. Do you know how to generate that?

Although I download only part of the dataset, about 2.3GB, it is enough to process these steps. I just want to get familiar with the procedure and process my own data.

If anyone could provide a script running 4 steps correctly, it would be very helpful.

ADD REPLY • link 8.7 years ago by thustar ▴ 130

2

Entering edit mode

bbmap.sh ref=/path_to/scaffolds.fasta will generate a bbmap index for that reference. It will be in a directory called "ref". There will be sub dir's and files under that directory.
You can move that folder into the right spot (RayOutput?) when done.
And you will need to amend the bbmap command as follows

bbmap.sh in=Scaffolds.fasta out=trial.sam idfilter=0.9 minlen=2500 path=/path_to/Folder_containing_ref_folder

ADD REPLY • link 8.7 years ago by GenoMax 148k

0

Entering edit mode

Thanks for your advice. I ran "bbmap.sh ref=scaffolds.fasta" and it generated a folder called "ref".

Then I ran the command "bbmap.sh in=Scaffolds.fasta out=trial.sam idfilter=0.9 minlen=10" and the result was:

java -Djava.library.path=/home/biotools/bbmap/jni/ -ea -Xmx4462m -cp /home/biotools/bbmap/current/ align2.BBMap build=1 overwrite=true fastareadlen=500 in=Scaffolds.fasta out=trial.sam idfilter=0.9 minlen=10
Executing align2.BBMap [build=1, overwrite=true, fastareadlen=500, in=Scaffolds.fasta, out=trial.sam, idfilter=0.9, minlen=10]

BBMap version 35.85
Retaining first best site only for ambiguous mappings.
Set genome to 1

Loaded Reference:   0.175 seconds.
Loading index for chunk 1-1, build 1
Generated Index:    0.824 seconds.
Analyzed Index:     2.879 seconds.
Started output stream:  0.030 seconds.
Cleared Memory:     0.127 seconds.
Processing reads in single-ended mode.
Started read stream.
Started 4 mapping threads.
Detecting finished threads: 0, 1, 2, 3

   ------------------   Results   ------------------   

Genome:                 1
Key Length:             13
Max Indel:              16000
Minimum Score Ratio:    0.56
Mapping Mode:           normal
Reads Used:             11459   (1960199 bases)

Mapping:            1.645 seconds.
Reads/sec:          6964.50
kBases/sec:         1191.36


Read 1 data:        pct reads   num reads   pct bases      num bases

mapped:             100.0000%       11459   100.0000%        1960199
unambiguous:         99.9040%       11448    99.8694%        1957639
ambiguous:            0.0960%          11     0.1306%           2560
low-Q discards:       0.0000%           0     0.0000%              0

perfect best site:  100.0000%       11459   100.0000%        1960199
semiperfect site:   100.0000%       11459   100.0000%        1960199

Match Rate:               NA           NA   100.0000%        1960199
Error Rate:           0.0000%           0     0.0000%              0
Sub Rate:             0.0000%           0     0.0000%              0
Del Rate:             0.0000%           0     0.0000%              0
Ins Rate:             0.0000%           0     0.0000%              0
N Rate:               0.0000%           0     0.0000%              0

Total time:         5.749 seconds.

It seems to work. By the way, how to interpret the result? Is there any instruction on some website?

What's more, I am confused with the step 3 "run megahit on unmapped reads and mates". I read the instruction on https://github.com/voutcn/megahit, but it did not mention anything related to unmapped reads. What is an unmapped reads? After I ran bbmap, I just got a file, "trial.sam". Could you please explain that? Thanks.

ADD REPLY • link updated 8.7 years ago by GenoMax 148k • written 8.7 years ago by thustar ▴ 130

1

Entering edit mode

In this small example all reads appear to map to the scaffold.fasta. Stats from BBMap are self-explanatory. Do you have a question about something specific
If you want to collect unmapped reads then the bbmap command would need to be modified as

bbmap.sh in=Scaffolds.fasta out=trial.sam outu=unmapped.bam idfilter=0.9 minlen=2500 path=/path_to/Folder_containing_ref_folder

By default out= contains both mapped and unmapped reads. Explicitly adding outm= catches only mapped reads and outu= will contain unmapped reads. See more info about BBMap here.

ADD REPLY • link 8.7 years ago by GenoMax 148k

0

Entering edit mode

Thank you so much!

I ran the command bbmap.sh in=Scaffolds.fasta out=mapped.sam outu=unmapped.bam idfilter=0.9 minlen=500 and got unmapped reads "unmapped.bam". Then I ran the command megahit -r unmapped.bam -o mega_unmap, but failed. The error in mega_unmap/log is

MEGAHIT v1.0.4-beta-4-g58b0995 --- [Mon May 2 23:07:58 2016] Start assembly. Number of CPU threads 4 --- --- [Mon May 2 23:07:58 2016] Available memory: 8083988480, used: 7275589632 --- [Mon May 2 23:07:58 2016] k list: 21,41,61,81,99 --- --- [Mon May 2 23:07:58 2016] Converting reads to binaries --- /home/jin/biotools/megahit/megahit_asm_core buildlib mega_unmap/tmp/reads.lib mega_unmap/tmp/reads.lib [read_lib_functions-inl.h : 209] Lib 0 (unmapped.bam): se, 11257 reads, 19 max length [utils.h : 124] Real: 0.0037 user: 0.0039 sys: 0.0000 maxrss: 7368 --- [Mon May 2 23:07:58 2016] Extracting solid (k+1)-mers for k = 21 --- cmd: /home/biotools/megahit/megahit_sdbg_build count -k 21 -m 2 --host_mem 7275589632 --mem_flag 1 --gpu_mem 0 --output_prefix mega_unmap/tmp/k21/21 --num_cpu_threads 4 --num_output_threads 1 --read_lib_file mega_unmap/tmp/reads.lib [sdbg_builder.cpp : 116] Host memory to be used: 7275589632 [sdbg_builder.cpp : 117] Number CPU threads: 4 [cx1.h : 450] Preparing data... [read_lib_functions-inl.h : 256] Before reading, sizeof seq_package: 90072 [read_lib_functions-inl.h : 260] After reading, sizeof seq_package: 90072 [cx1_kmer_count.cpp : 136] 11257 reads, 19 max read length [cx1.h : 457] Preparing data... Done. Time elapsed: 0.0010 [cx1.h : 464] Preparing partitions and initialing global data... [cx1_kmer_count.cpp : 227] 2 words per substring, 2 words per edge [cx1_kmer_count.cpp : 364] Memory for reads: 90104 [cx1_kmer_count.cpp : 365] max # lv.1 items = 0 [cx1.h : 480] Preparing partitions and initialing global data... Done. Time elapsed: 0.0120 [cx1.h : 486] Start main loop... Bucket 65536 too large for lv1: 0 > 0 Error occurs when running "sdbg_builder count/read2sdbg", please refer to mega_unmap/log for detail [Exit code 1]

I have no idea how to deal with this error. Do you have any advice?

ADD REPLY • link 8.7 years ago by thustar ▴ 130

0

Entering edit mode

Look in mega_unmap/log file for details as the error message suggests.
You may be running megahit wrong. See this link which has an error similar to yours.

ADD REPLY • link 8.7 years ago by GenoMax 148k

0

Entering edit mode

Thanks for your advice.

I found I was too foolish because the input file of megahit should be fasta or fastq, but I used sam file as the input, leading to that error.

When I ran the command bbmap.sh in=Scaffolds.fasta out=trial.sam outu=unmapped.sam idfilter=0.9 minlen=500, I got a file named "unmapped.sam". But the content of "unmapped.sam" is:

@HD VN:1.4  SO:unsorted
@SQ SN:scaffold-0   LN:2894
@SQ SN:scaffold-1   LN:1900
@SQ SN:scaffold-2   LN:2158
...
@SQ SN:scaffold-11253   LN:124
@SQ SN:scaffold-11254   LN:2239
@PG ID:BBMap    PN:BBMap    VN:35.85    CL:java -Djava.library.path=/home/jin/biotools/bbmap/jni/ -ea -Xmx4285m align2.BBMap build=1 overwrite=true fastareadlen=500 in=Scaffolds.fasta out=trial.sam outu=unmapped.sam idfilter=0.9 minlen=500

I did not see any sequences. Is that sam file a correct output of bbmap?

ADD REPLY • link updated 8.7 years ago by GenoMax 148k • written 8.7 years ago by thustar ▴ 130

1

Entering edit mode

As we discussed above, in this toy example all sequences in your input appear to map to scaffolds.fasta so there are no sequence left unmapped. That is likely why your unmapped.sam file just contains headers and no sequences. You may want to start with a bigger dataset as input to get unmapped reads.