Question

How To Deal With Un-Used Reads After De Novo Assembly?

9

Entering edit mode

13.1 years ago

Lhl ▴ 760

Hi All,

I have been trying to combine all genomic resources produced by different sequencing platforms in our lab and assemble them into contigs.

Since we do not have reference genome for our species, we did de novo assembly.

When we finish the assembly, we still have a lot of un-used reads (~30GB).

It doesn't seem to be very reasonable to simply discard them. But i do not know what should i do to take advantage of them.

I am wondering if anyone of you has similar experience and know how to do with it.

Thanks in advance for your valuable suggestions and discussions!

denovo assembly read • 5.9k views

ADD COMMENT • link updated 12.7 years ago by Michael 55k • written 13.1 years ago by Lhl ▴ 760

score 5 · Answer 1 · 2011-11-05

In addition to contamination I would consider these possibilities:

Reads from highly repetitive regions. These can cause problems with the assembly and the software might have removed them in preprocessing, repetitive sequences could be checked with e.g. a low-complexity filter or repeat finders (e.g. dust, repeatmasker)
Low quality reads. Check the base quality, and get some statistics about the reads using e.g. the FastQC tool (maybe start with a subset)
Singleton reads, e.g. from low coverage regions, if the read does not overlap with anything else it cannot be assembled
Reads that are contaminated with vector sequences, check Blast against a vector database
The sample is contaminated with a microorganism (alchemixt), however if the contaminant genome is small compared to the target, it might happen that the contaminant reads assemble better than the target genome. A blast search against NT might indeed help.

It is a bit hard to speculate further without knowing more details.

Edit: another idea would be to try a 'meta-genomics' approach on the left over reads if you suspect contamination. E.g. do blast against NT or NR and use MEGAN to classify the reads.

Edit2: I have been looking a bit more into validation of assemblies, and came across the AMOS genome assembly validation tool. On their website, there is a very relevant cite supporting the contamination hypothesis but also raising an additional interesting point:

Unused read information - Not all reads provided as input to an assembler are used in the final assembly. The unused reads, also called singletons, are often contaminants or insufficiently trimmed reads from the genome. Mis-assemblies, however, also lead to the presence of unused reads, as they are inconsistent with the chosen reconstruction of the genome. As an example, the reads spanning the join point of two copies of a tandem repeat are listed as singletons when the assembler incorrectly collapses this repeat. By aligning the singletons to the contigs produced by the assembler we can identify such misassemblies.

I haven't tried AMOS yet but possibly will do so soon.

Ram · Answer 2 · 2011-11-06

2

Entering edit mode

13.1 years ago

Jeremy Leipzig 22k

I have found that crucial reads can be held hostage in spurious contigs that go nowhere when you use an insufficient coverage cutoff. Raising the cutoff can break up these bad contigs, allowing them to join good ones and in turn recruit more reads.

alt text

By the way, reads from repetitive regions such as retrotransposons generally do make it into assemblies (into small contigs of incredibly high depth), but they destroy any possibility of determining gene order. That is why people (especially on the plant side) still sometimes rely on sanger to "sequence through repeats" - i.e. connect regions flanking repeats.

ADD COMMENT • link 13.1 years ago by Jeremy Leipzig 22k

0

Entering edit mode

This is informative. Many Thanks, Jeremy.

ADD REPLY • link 13.1 years ago by Lhl ▴ 760

0

Entering edit mode

How did u get different bubbles in ggplot ? Do you mind sharing the command. Thanx!

ADD REPLY • link 13.1 years ago by Curiosity ▴ 130

0

Entering edit mode

http://code.google.com/p/standardized-velvet-assembly-report/source/browse/trunk/refReport.Rnw

ADD REPLY • link updated 5.2 years ago by Ram 44k • written 13.1 years ago by Jeremy Leipzig 22k

score 1 · Answer 3 · 2011-11-05

1

Entering edit mode

13.1 years ago

Herefordguy ▴ 10

In addition to the other valuable comments, have you evaluated a different assembler? Some assembliers (ALLPATHS-LG, MSR-CA, etc) do a better job of error correction than others, and thus use more of the data.

ADD COMMENT • link 13.1 years ago by Herefordguy ▴ 10

0

Entering edit mode

To date, I only tried Ray and Velvet. Both of these two yield lots of Un-used reads. I will be happy to try other assemblers and see if they work out better. Thanks for your suggestion!

ADD REPLY • link 13.1 years ago by Lhl ▴ 760

score 1 · Answer 4 · 2011-11-07

1

Entering edit mode

13.1 years ago

ALchEmiXt ★ 1.9k

In addition to Michael's answer. You might want to check for badly CLIPped sequences. For instance adapters at wrong locations..... Illumina mate-pair PE libraries are famous for those artefacts.

We routinely check against a set of bowtie genome indices including vector (as suggested) but also a DB with commonly used adapters. You'll be surprised.....in the bad way...

Have a look at for instance the fastq_screen tool

ADD COMMENT • link 13.1 years ago by ALchEmiXt ★ 1.9k

1

Entering edit mode

@lhl: Have a look at the link for fastq_screen it details how to handle it. We have for many genomes some botie indices on the server anyway and allow the user to select the databases to screen for in their particular case. Separate we have UNIVEC, PhyX (incl a extra region spanning the origin) and an adapterDB. Those adapter sequences can be retrieved from that link as well (have a look in the readme). Otherwise PM me and I can send it. It's quite small.

ADD REPLY • link 13.1 years ago by ALchEmiXt ★ 1.9k

0

Entering edit mode

Hi ALchEmiXt,

This is helpful.

A quick question - about the sequence databases against which i will use my raw reads to search. should it be a database containing all bacterial nucleotide + adaptor sequences? And how to get all the adaptor sequences?

Thanks a lot!

ADD REPLY • link 13.1 years ago by Lhl ▴ 760

0

Entering edit mode

thanks very much ALchEmiXt. Cheers -- lhl

ADD REPLY • link 13.1 years ago by Lhl ▴ 760

score 0 · Answer 5 · 2011-11-05

0

Entering edit mode

13.1 years ago

ALchEmiXt ★ 1.9k

Dit you check a sub-set of the non-assembled reads what it is (e.g. by BLAST)? Even though no reference is available....you might not be the first to be surprised by a contaminating yeast or bacterium...If that is the case; the answer is simple: trash it.

ADD COMMENT • link 13.1 years ago by ALchEmiXt ★ 1.9k

0

Entering edit mode

Hi AlchEmiXT,
Yes i already did the blast to remove potential contamination! But thanks for your response.! Cheers

ADD REPLY • link 13.1 years ago by Lhl ▴ 760

0

Entering edit mode

If contamination, reads should assemble in contig anyway. Am I wrong?

ADD REPLY • link 13.1 years ago by Frédéric Bigey ▴ 310

0

Entering edit mode

If contaminants introduce ambiguity into the graph they can disturb an assembly.

ADD REPLY • link 13.1 years ago by Jeremy Leipzig 22k

0

Entering edit mode

I did an assembly without trimming the contamination, i did get some bacterial contigs. In fact, they are quite long!

ADD REPLY • link 13.1 years ago by Lhl ▴ 760