Identifying my WGS isolates
2
0
Entering edit mode
8.2 years ago

Hi

Assembly rookie here. I have been BLASTing some of my contigs from a de novo assembly I did, and the results are coming back with very poor query coverage. Since this is an environmental isolate, I guess it's possible that there aren't many whole genome records. I tested contigs of varying sizes (largest 338 075 nt with total read count of 35 131 and coverage of 26; down to smaller contigs in the 1500 nt range with total read count of 1647). What I want to know is what is the chance that my trouble with identification is due to a) contamination in the DNA; or b) that my assembly is so shitty BLAST can't make a match?

Thanks in advance

next-gen blast Assembly • 1.5k views
ADD COMMENT
0
Entering edit mode

Have you taken a few of the original reads and blasted them to make sure they show reasonable/expected hits?

ADD REPLY
0
Entering edit mode

When you have nice, long contigs that don't BLAST to anything, it's likely that it's a novel organism (at least, distant compared to anything in your BLAST database). Metagenomes usually have low coverage, and assemblers don't tend to output random junk as long contigs. Contamination is more likely to give you good BLAST hits than the metagenomic target, since the same contamination tends to be seen in lots of labs. Thus, contamination is not a very likely explanation for this outcome.

Oops, I misread and thought this was a metagenome rather than an isolate. If the species you expect is in your BLAST database, this is not the species you expect.

ADD REPLY
0
Entering edit mode

It depends on how you are defining contamination. If you think you are working with species A, but your assembly gives you species B, that could be considered contamination even if it's a great assembly. It's not clear if that's the case here.

ADD REPLY
0
Entering edit mode
8.2 years ago
igor 13k

You can try running all your raw reads through a metagenomic classifier. There are a few of those out there, such as:

These will quickly classify a lot of reads and show you if you have contamination. Ideally, most of them will be from a similar source.

ADD COMMENT
0
Entering edit mode
8.2 years ago
Tonor ▴ 480

I would also try DIAMOND: https://github.com/bbuchfink/diamond

It uses blastx and translates your DNA contigs/reads and searches against a protein databases rather than nucleotide blastn. As Brian said when you have large good quality contigs that don't hit anything, it tends to imply a novel organism. But protein space is better to use in this case, as protein sequences are more conserved between related species than nucleotide sequences, so you even though a blastn may give you no significant hits, blastx could well give you some to related organisms.

ADD COMMENT

Login before adding your answer.

Traffic: 2090 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6