Hello Biostars!
I am currently looking for a tool similar to fastq screen: https://www.bioinformatics.babraham.ac.uk/projects/fastq_screen/fastq_screen_documentation.html, which is able to roughly characterize genome composition (Did we sequenced the right species? Do we have contamination in our sequences?) with nice graph directly from a subset sample of fastq reads (like showing the amount of hits for several species and such).
I will probably try fastq screen, but it is specified that the tool may be more suitable for short reads technologies as it use short read aligner such as bowtie2. I thought maybe such a tool exist for longer and more erroneous reads. Or maybe a combination of a suited aligner (like ngmlr for long reads and higher error rate awareness) and then an other tool could do it?
Does anyone have any ideas or suggestions? I'll keep you updated on my own findings!
Cheers,
Roxane
Take a look at https://www.biorxiv.org/content/early/2018/07/20/372474
Interesting tool. So as I understand, it's possible to specify a database of our choice right ?
What I meant was : is it possible to use as compared to genome others stuff than bacteria ? I would like to be able to detect like is it's plant DNA, bacteria, mamals, fish... Something way more general. The output of metamaps show very high precision and recall but on like genus and family of bacteria. And I wonder if the tools works outside the context of metagenomic.
I should think it's just a case of building a representative dataset for what you're interested in as it is with my suggestion of Kraken. Since its a new tool there probably aren't many benchmarks datasets outside of the authors lab. Adam Phillipy and crew are nice though, so I'd just mail them and explain what you want to do!
Roxane Boyer : I would suggest using DIAMOND against
nr
, if you have enough compute resources available. Your long reads are not going to be more than a million so it should be a workable option.That is a nice suggestion. Is it well adapted to long ONT reads tho ? On their page they specify that it's faster than blast in the case of short Illumina reads, but must be different for ONT reads.
I was able to search
nr
using DIAMOND with ~1070 fastq sequences contained in a ONT data file. Reads ranged from 375 bp to 126000 bp. I left all settings at default. It appears that DIAMOND reports a max of 25 hits per sequence. It took about ~6 h. DIAMOND can make a SAM file.I am going to try with a set of ONT reads I have. Will let you know.
not sure if it would serve your request but you might have a look at NanoPlot ?
From what I understand from NanoPlot, it doesn't seems it will help me to answer to the "What did I have sequenced" question. I was more looking for a tool or a method that can determinate what organisme the reads originate from, which will both check for contamination and genome characterization (like a QC check). But thanks for the suggestion I'll keep that tool in mind !
NanoPlot does not exactly do what OP is asking for :)
From the NanoPlot author :o)
I can imagine (using minimap2) it would be fairly easy to write a fastq-screen-for-long-reads. But then again, I should be writing my thesis instead.
I may give a try to that task myself Here is a "good luck" upvote for your thesis :)
Thanks!
I would approach it in Python, roughly similar to what I did for NanoLyse (which removes lambda reads from a fastq file): https://github.com/wdecoster/nanolyse/blob/master/nanolyse/NanoLyse.py#L101
I would use the python API for minimap2, mappy, and just check "does an alignment exist for this read on that genome" and keep count of that.
Was thinking with python as well, as the main goal is to integrate this analysis into an existing python workflow that seemed to be the easiest solution indeed. Even better if there is an API. Thanks for the reference of NanoLyse, I'll have a look :)
Let me know if you get stuck - I need some coding during my writing to remain sane.
Who doesn't ;) I'll keep you updated when I'll try something then !
Did you guys manage to write a fastq-screen-for-long-reads based on minimap2? I am considering the same idea and wanted to make sure I was not reinventing the wheel.
Hi Benoit ! Nope, in the end I had some others projects I had to finish before this one, so I did not had any time to think about this... So you won't be reinventing the wheel ; ) I'll be very interested to test it if you are going to work on that ! Cheers, Roxane
Hi Roxane, I've modified FastQ Screen so it now includes minimap2 as one of the alignment options, and can therefore process long-read data. I've submitted the code changes to the team maintaining FastQ Screen and I am hoping they are going to look into releasing it as the next version of FastQ Screen.
my bad indeed :/ , understood OP's question wrongly (completely wrong even)
Do you expect specific contaminants or are you just trying to do a survey of what is there? You could always use
minimap
with the expected genome and figure out what remain unaligned.I'm not looking for a particular contaminant, indeed it's more like a survey of my reads content. Minimap and miniasm seems indeed interesting thanks !