Question

Problem with "split_and_run_sparc.sh" from DBG2OLC pipeline

0

Entering edit mode

8.2 years ago

Josué Barrera ▴ 10

Hi everybody!

I'm having a problem in the consensus stage of the DBG2OLC pipeline. I'm using the script "split_and_run_sparc.sh" to obtain the "final_assembly.fasta" file from my backbone file (backbone_raw.fasta) and my reads (ctg_pb.fasta). I ran the script using the following command:

sh ./split_and_run_sparc.sh backbone_raw.fasta DBG2OLC_Consensus_info.txt ctg_pb.fasta /tmp/consensus_dir 2 >cns_log.txt

While running the script, an error messages appeared:

Traceback (most recent call last): File "./split_reads_by_backbone.py", line 131, in <module> File "./split_reads_by_backbone.py", line 122, in main IOError: [Errno 24] Too many open files: '/tmp/consensus_dir/backbone-1627.reads.fasta'

After the analysis, I observed some inconsistencies between the "backbone_raw.fasta" file and the "final_assembly.fasta" file:

---------------- Information for assembly 'backbone_raw.fasta' ----------------

                                       Number of contigs       1906
                          Number of contigs in scaffolds          0
                      Number of contigs not in scaffolds       1906
                                   Total size of contigs  252974640
                                          Longest contig    2502428
                                         Shortest contig       4957
                               Number of contigs > 1K nt       1906 100.0%
                              Number of contigs > 10K nt       1872  98.2%
                             Number of contigs > 100K nt        512  26.9%
                               Number of contigs > 1M nt         31   1.6%
                              Number of contigs > 10M nt          0   0.0%
                                        Mean contig size     132725
                                      Median contig size      35400
                                       N50 contig length     449759
                                        L50 contig count        147

---------------- Information for assembly 'final_assembly.fasta' ----------------

                                       Number of contigs       1020
                          Number of contigs in scaffolds          0
                      Number of contigs not in scaffolds       1020
                                   Total size of contigs  223116219
                                          Longest contig    2502428
                                         Shortest contig         83
                               Number of contigs > 1K nt       1018  99.8%
                              Number of contigs > 10K nt       1009  98.9%
                             Number of contigs > 100K nt        470  46.1%
                               Number of contigs > 1M nt         31   3.0%
                              Number of contigs > 10M nt          0   0.0%
                                        Mean contig size     218741
                                      Median contig size      82745
                                       N50 contig length     548456
                                        L50 contig count        117

The main inconsistencies between both files is that:

The number of contigs almost halved
The total size of the assembled genome is reduced (since I have 886 less contigs)
Some contigs became smaller (as observed in the "Shortest contig" section)
N50, mean and median contig sizes inflated (as a by-product of losing contigs)

Does anyone know if the inconsistencies observed between both files is determined by the error message that appeared while the script was running? Or is this the normal output one should expect after running the consensus stage of the pipeline?

P.D.: I could not run the command "ulimit -n unlimited" before running the script, since I don't have root privileges in the cluster I'm working on. Not sure if this explains the inconsistencies or the error message.

genome Assembly correction hybrid • 2.8k views

ADD COMMENT • link updated 8.2 years ago by colindaven 7.0k • written 8.2 years ago by Josué Barrera ▴ 10

score 1 · Answer 1 · 2016-09-21

1

Entering edit mode

8.2 years ago

colindaven 7.0k

I had a problem with this stage too. I never got a final assembly out but was stuck at the "backbone_raw.fa" stage.

I did have root access and tried repeatedly to set the ulimit, but it didn't work well and there is only so many times you can restart servers in a cluster without starting to annoy people.

I got a reasonable final assembly out using Racon https://github.com/isovic/racon in the end.

ADD COMMENT • link 8.2 years ago by colindaven 7.0k

0

Entering edit mode

I'll try it out.

Thank you very much!

ADD REPLY • link 8.2 years ago by Josué Barrera ▴ 10

0

Entering edit mode

I am also having issues with the consensus stage of dbg2olc, but in my case the "final_assembly.fasta" that is generated is empty, even though there is no error message.

So I would like to try your suggestion and run Racon with the "backbone_raw.fasta" assembly from dbg2olc. However, I don't know which file to use as the "overlap/alignment" input file, which is necessary for Racon ("Racon takes as input only three files: contigs in FASTA/FASTQ format, reads in FASTA/FASTQ format and overlaps/alignments between the reads and the contigs in MHAP/PAF/SAM format"). The manual of dbg2olc is not very clear, and I'm not sure if such a file is actually generated during the assembly. Would you remember which file you used in your case or if you have to generate an overlap/alignment file with a different software?

ADD REPLY • link 5.8 years ago by mths_b ▴ 40