Hello Biostar,
I have a question I'm unable to answer myself for several weeks now.
I would like to add a new QC analysis to my pipeline (ONT data) which would be able to both :
- detect if our sequences contains any contaminants (bacteria ? virus ? human DNA ?)
- detect if the sequences belong to the specie we expected to sequence (for example, if the sequenced DNA is from an european perch, the expected result would be that my sequences are mapping to a fish genome of reference)
For that, the approach I can think of is to compare my reads to a bunch of references genomes. So what I had in idea was to use a chosed genome that is "common" (I know it doesn't really mean anything) enough so my reads map well to that reference. For example, I would like to use one virus genome to be able to detect any kind of viral contamination, and the same goes on with human genome, one bacteria genome, a fish genome... etc. I don't want to use a gigantic database with a bunch of everything because I don't want to have something to heavy to align again and because I don't need that precision.
I know this is not the best approache at all. Because there is no such thing as a reference genome for all virus or a reference genome for all bacteria. But I thought that for my simple purpose (because I do not try to identify exactly the contaminant and I don't want to retrieve the contaminated reads either), it could eventually fit. But I struggle to know what reference genome I should use for optimal results. Escherichia coli for bacteria ? Drosophila melanogaster for insects ? Or other way I was thinking of is to create several hybrid fasta for each categorie I would like to detect, and that fasta file would contain 4 or 5 different species for each genus (5 insects for insects, 5 bacteria for bacteria...).
What do you think about my ideas ? Do you see any major cons of such an approache that won't fit the analysis I'm trying to make ? Do you have any other suggestions I coudln't think of ? Thanks a lot for your advices !
Cheers,
Roxane
This tool is awesome... Making a lot of test with it and it's a fast and smart approache... Thanks a lot for advising that to me !
It's IMO by far the most overlooked bioinfo tool of the last few years..
I can understand why ! I still need to try out few more test in order to make it fit my purpose tho. Because for now a mash dist with whole refseq and a fastq of fish sequence best matche with mammals genome before fish genomes... If you have any kind of experience with parameters I should use to build the sketch or to measure the distance, your advices would be welcomed !
Are you sure your fastq includes only fish DNA? Mash dist individual reads and maybe you'll see that some of them are fish and others something else? I've built my own RefSeq bacterial genomes DB with -k 21 and -s 5000
I have therefore an other question : don't you think that this tool, which kinda compress a sequence using it's most representative k-mers, would be a lot impacted by the used sequencing technolgy ? Both for the raw reads than for the resulting assemblies. Raw PacBio/ONT reads are longer but more erroneous, raw Illumina reads are short but with high quality. Is there any studies comparing sequences coming from a same species but generated with a different sequencing technology ?
I'm kinda afraid of the impact of my high error rate in my ONT reads.
What you are doing is only a
qualitative
analysis correct? This isn't in the category of being 100% sure about what all is in there. For the former purpose this should be adequate.Exclude singleton k-mers and it should be fine..
Thanks, I'll have a look on it, but as I said to genomax, I'm not really asking for advices on a tool, but rather on how should I build my own small reference database... ;(
I don't think there is a good/correct answer for that. No matter what you select you will likely miss some other thing. It will depend on what you are comfortable with.
Does the idea of constructing a hybrid fasta file containing several genome of a specified target I want to identify seems to be shocking according to you ? I realise it may be the main point of my post in the end.
No. But you are again likely to miss many things by cherry picking.
Instead of doing alignments you could use
sketch/hash
as referred to by @5heikki with RefSeq (that should be comprehensive). BBMap has tools to do that kind of searches as well.More than that these Mash databases can be really small, e.g. all RefSeq in less than 100 MB