De novo assembly programm
2
0
Entering edit mode
6.5 years ago
Chvatil ▴ 130

Hi everyone I was wondering what kind of tool you use to assemble your genome.? I already tried IDBA_ud but to make it works I have to reduce de number of read.

Then I tried ALLPATH-LG but it is not suitable for my data, I only have paired-end read R1 and R2 in fastaq format.

I also tried Masurca but there is a problem in the process for a lot of people and the devellopers seem to do not answer about this issue.

So, maybe you know a good programm to do it with my data?

The 2 fasta files are comming from an illumina Hiseq 3000 150bp and the genome size of my specie is around 1.5 GB.​ Any clue would be kind.

Have a nice day ​

assembly • 3.3k views
ADD COMMENT
4
Entering edit mode

I already tried IDBA_ud but to make it works I have to reduce de number of read.

How did you reduce the number of reads? Did you just down-sampled, or you performed digital normalization? And did you get an assembly or not? I will be honest with you, with a 1.5Gb genome and only paired-end Illumina sequencing, you won't get a good assembly anyway, no matter what you do.

You have been having problems with your assembly for some time, so lets take a step back and check for some potential problems:

1) How is the quality of you sequencing?

2) Are you trimming adapters?

3) What is the expected coverage?

4) Did you check for contaminants (bacterial / human /whatever other species) on your reads?

ADD REPLY
2
Entering edit mode

I would add to this: have you checked the overall kmer abundance distribution, to see whether the sample is heterozygous, coverage is as expected, etc? Jellyfish + GenomeScope are useful here. Similarly preQC worked well for us in identifying problematic assemblies, which can do a bit of the above but compares your results to other known assemblies of varying complexity.

But I completely agree w/ h.mon, a 1.3Gb genome requires much more than simply paired-end data. You need contextual information to work around repetitive regions or problematic areas (high quality mate-pairs, long reads, 10x, HiC, etc). Every large-scale assembly (and the strategy used) is different, but your original paired-end can at least give you some hints as to how complex it may get and maybe the best options to improve it.

ADD REPLY
1
Entering edit mode

In case you appreciate a more pragmatic outlook on the expected results, compare the genome assemblies of the Chinese Hamster Ovary cell lines ~2.5 Gbp genome:

  • CHO K1 CriGri_1.0 with > 100k contigs based on "sequencing libraries with insert sizes of 200 bp, 350 bp, 500 bp, 800 bp, 2 kb, 5 kb, 10 kb and 20 kb"
  • CHO K1 CHOK1S_HZDv1 with less than 10k contigs "used a hybrid approach with Illumina short range, Illumina long range with Dovetail Chicago library and Bionanogenomics optical mapping to construct the genome"

The >100k contig assembly used many different library sizes and was nevertheless difficult to work with.

ADD REPLY
1
Entering edit mode
6.5 years ago

As the above commenters mention, a large genome is not going to be assembled well with just paired end short read data. You'll need a good set of mate paired aka long jumping distance reads to go with it if you stick with Illumina.

However I would really recommend a long read library and going for a hybrid assembly. Nanopore will give you very long reads, why not ask for pricing from a GridIon or Promethion service provider. Try to get the longest reads you can, you won't regret it.

Finally, for short read assemblies I recommend Soapdenovo - installable via bioconda - for fast and decent assemblies. You'll probably need a lot of RAM (256 - 512 GB ?).

Make sure you follow h.mon's advice above too.

Best of luck.

ADD COMMENT
0
Entering edit mode

Thank you, for your help to all, I will try the soap after making Kmer estimation. I'll let you know how it goes. I'm on a lab server, I think it should be ok for the RAM.

@h.mon I already done an assembly for another sample for the same specie and I got a coverage about 10X and I checked the quality by looking for BUSCO (highly conserved genes among arthropoda) gene and I got aroun 80% of them but only 40% where complet. This genome is full of repeat element.

ADD REPLY
0
Entering edit mode

If you have more data for that same species (sample?) you'd be better of combining all that data together, it will not pay off to assemble each sample separately

ADD REPLY
0
Entering edit mode
6.5 years ago
Joe 21k

I would also suggest SPAdes. It's pretty fast, even on large genomes, and is among one of the best in terms of contig stats etc.

It'll trial several different kmer sizes for you automatically.

ADD COMMENT
0
Entering edit mode

Thank you very much !! it's running and I used:

#PBS -S /bin/bash
#PBS -l nodes=1:ppn=8:bigmem,mem=100gb
#PBS -e /pandata/LEPIWASP/ACG-0006_0027/0006-spades.error
#PBS -o /pandata/LEPIWASP/ACG-0006_0027/0006-spades.out
#PBS -N ACG-0006_spades
#PBS -q q1hour


illumina_R1=/pandata/LEPIWASP/ACG-0006_0027/ACG-006_TGACCA_L003_R1.fastq
illumina_R2=/pandata/LEPIWASP/ACG-0006_0027/ACG-006_TGACCA_L003_R2.fastq

python /panhome/TOOLS/SPAdes-3.12.0/spades.py -1 $illumina_R1 -2 $illumina_R2 --careful --cov-cutoff auto -o /pandata/LEPIWASP/ACG-0006_0027/

How many time do you think it will takes, Should I put it in -q 1 week? And do you think 8 ppn and 100 gb is enough?

By the way I made a kmergenie running and the best kmer is 55, can spades use this information?

The 2 fasta files are comming from an illumina Hiseq 3000 150bp and the genome size of my specie is around 1.5 GB.

ADD REPLY
0
Entering edit mode

optimal Kmer of 55 with input reads of 150bp seem kinda low to me ... is it a (highly) heterozygous genome ?

ADD REPLY
0
Entering edit mode

I'm not aware of that but I know that it is a butterfly genome full by repeat elements.

ADD REPLY
0
Entering edit mode

I don't think you need to bother with any optimisation for starters. SPAdes does several rounds with progressively larger kmers and outputs the contigs from all of them so you can assess for yourself if you wish.

I honestly can't tell you how long it'll take as I've only ever assembled microbes on it (finishes in about 20 mins usually, but depends on numbers of reads etc). It does support restarting from checkpoints though I believe, so even if your run doesn't complete in the wall time you have, it may not matter as you'll be able to resume with minimal computation lost.

ADD REPLY
0
Entering edit mode

Ok thank you because when I put it on a 1day queue I get a broken pipe in the .error file; do you know how can I continue at the checkpoint when I run another time in a queue 1 day instead of restarting from zero?

ADD REPLY
0
Entering edit mode

Not off the top of my head, I would suggest digging in to their documentation if you intend to use some more of the advanced features.

ADD REPLY
0
Entering edit mode

The broken pipe error is likely because the job gets killed on the cluster. depending on how this is done specifically it will throw this error as it tries to write to a non-open stream (due to job kill). It does has nothing to do with what you're executing in your job, it's a cluster management thing

I observe that too sometimes when jobs get killed due to walltime excess

ADD REPLY
0
Entering edit mode

Yes so if the job is killed I guess that I cannot continue at the checkpoint. I put it on a 1 week queue it's ok.

ADD REPLY
0
Entering edit mode

well, it should be able to continue from the checkpoint, regardless of job killed or not. It of course needs to have passed at least one checkpoint before it can continue from it, otherwise there is no checkpoint available to restart from.

I just wanted to point out the error message your described is not related to the software you are running (SPAdes in this case) but is a notice due to cluster management software

ADD REPLY
0
Entering edit mode

Note that SPAdes is primarily developed for small genomes; it tends to use a lot of memory, and I have found it impossible to use for many assemblies above 200-300Mb.

Have you tried Platanus? I believe this worked for past butterfly genomes and can handle highly heterozygous genomes. You can at least contig build and scaffold with the paired-end data.

Simply put though, you will still get a fragmented assembly, paired-end data alone only gets you so far. You will still need mates or other data for a better assembly. These days I would also recommend something like 10x, HiC, or long reads to help supplement this (Platanus can also be used with mates).

ADD REPLY
0
Entering edit mode

Hi chris, Spades is actually running, maybe it will work? If Spades does no work I'll try Plantanus as you advise me but if Spades is running it does mean that it is ok for my data is not it? Note that I do not really need a perfect assembled genome, for exemple I used IDBA_ud for the genome of another butterfly in the same specie and the assembly was not perfect but I found at least 80% of BUSCO gene with 40% fragmented, which is perfect for my analyse. By the way with IDBA_ud I got an average coverage at 30 X only with paired-end.

ADD REPLY
0
Entering edit mode

Hey, if it works well enough for your needs then I think that's fine. We've had SPAdes killed for using too much memory on a number of genome assemblies in the 200-800MB range (we have a 1TB assembly node and access to a 3TB if needed), and the assemblies that do finish take forever and are less ideal assemblies compared to other tools like MEGAHIT, which run far faster.

ADD REPLY
0
Entering edit mode

OK, so I should use MEGAHit instead of Spades for my data? It also can take only paired-end right? In fact I'm looking for anassembler tools which is not complicate to use and can accept only paired-end for a large genome. Is planatus the easiest to handle?

ADD REPLY
0
Entering edit mode

Isn't MEGAHIT not focussed on metagenomic analyses ? (which of course doesn't mean it can't work on 'normal' assemblies)

ADD REPLY
0
Entering edit mode

Megahit is running but does it give me a scaffold.fa output? I only can see contig output file

ADD REPLY

Login before adding your answer.

Traffic: 1818 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6