Question

Best Way to Assemble DNA Sequencing Data

0

Entering edit mode

4 months ago

echolley ▴ 20

Hi there,

I have Illumina sequencing data for a given phage that was isolated through gel extraction. Prior to assembly, I trimmed using trim_galore on default settings. I have tried running it through both Spades and Megahit, but am getting very messy assemblies. According to the gel extraction, I should only be getting a 50,000bp contigs, but instead I'm getting thousands. Some of these contigs are way beyond any expected size (700,000bp) I have tried merging the reads prior to assembly per the advice of my PI, but this seems to be even worse for the sequencing, making much smaller contigs.

The FASTQC reports are good, the only red flag is sequence overrepresentation (to be expected with DNA sequencing data)

I'm new at this, and would like any advice/explanation as to why my analysis is going so haywire

DNA-sequencing assembly • 664 views

ADD COMMENT • link written 4 months ago by echolley ▴ 20

0

Entering edit mode

I wouldn't make the mistake of thinking what you "see in the biology" on a gel, necessarily translates to the assembly, particularly if you're using a de bruijn graph assembler.

Phages are notorious for having lots of repeats and 'junk' that can confuse assemblers. Generally when a repeat that cannot be resolved is found, the assembler will just break it in to 2 contigs.

I would suggest viewing the graphs with bandage to get a feel for the assembly.

Generally the trimming is not typically a big deal - the sequencer can be configured such that it removes adapter sequences etc in the first place, and many of the more sophisticated assemblers will also have methods to ignore/mitigate the impact of these, but it often doesn't hurt.

It is also worth looking at your predicted coverage if you have a reference genome, as very high coverage depths can also 'choke' DBG assemblers, and this is pretty likely with smaller genomes.

The best thing you'd be able to do is obtain some long-read data for the same samples and perform a hybrid assembly, but it isn't impossible to achieve with short reads alone.

Do you know how big you expect the genome to be?

ADD REPLY • link 4 months ago by Joe 22k

0

Entering edit mode

I'm not sure how it performs for phages, but you might also consider trying the shovill pipeline, which can intelligently tweak some assembly parameters for smaller genomes (microbes).

https://github.com/tseemann/shovill

ADD REPLY • link 4 months ago by Joe 22k

0

Entering edit mode

Besides the reasons mentioned, you may have an overabundance of data going into the assembly which may be confusing the assemblers. You may need to normalize the data ( guide for bbnorm.sh from BBTools: https://jgi.doe.gov/data-and-tools/software-tools/bbtools/bb-tools-user-guide/bbnorm-guide/ ) and retry the assembly.

ADD REPLY • link 4 months ago by GenoMax 148k

score 1 · Answer 1 · 2024-08-27

1

Entering edit mode

4 months ago

LChart 4.7k

One thing to do is to check that the sequences you have correspond to the expected library you sent, as errors in sample handling or demultiplexing do happen. If you run KRAKEN or BURST or another kmer classification program - do you get a majority of hits to non-phage genomes?

ADD COMMENT • link 4 months ago by LChart 4.7k