Looking for a tool like fastq screen but for ONT data
2
2
Entering edit mode
6.4 years ago
Rox ★ 1.4k

Hello Biostars!

I am currently looking for a tool similar to fastq screen: https://www.bioinformatics.babraham.ac.uk/projects/fastq_screen/fastq_screen_documentation.html, which is able to roughly characterize genome composition (Did we sequenced the right species? Do we have contamination in our sequences?) with nice graph directly from a subset sample of fastq reads (like showing the amount of hits for several species and such).

I will probably try fastq screen, but it is specified that the tool may be more suitable for short reads technologies as it use short read aligner such as bowtie2. I thought maybe such a tool exist for longer and more erroneous reads. Or maybe a combination of a suited aligner (like ngmlr for long reads and higher error rate awareness) and then an other tool could do it?

Does anyone have any ideas or suggestions? I'll keep you updated on my own findings!

Cheers,
Roxane

gridion nanopore ont minion • 4.5k views
ADD COMMENT
1
Entering edit mode
ADD REPLY
1
Entering edit mode

Interesting tool. So as I understand, it's possible to specify a database of our choice right ?

ADD REPLY
0
Entering edit mode

What I meant was : is it possible to use as compared to genome others stuff than bacteria ? I would like to be able to detect like is it's plant DNA, bacteria, mamals, fish... Something way more general. The output of metamaps show very high precision and recall but on like genus and family of bacteria. And I wonder if the tools works outside the context of metagenomic.

ADD REPLY
1
Entering edit mode

I should think it's just a case of building a representative dataset for what you're interested in as it is with my suggestion of Kraken. Since its a new tool there probably aren't many benchmarks datasets outside of the authors lab. Adam Phillipy and crew are nice though, so I'd just mail them and explain what you want to do!

ADD REPLY
0
Entering edit mode

Roxane Boyer : I would suggest using DIAMOND against nr, if you have enough compute resources available. Your long reads are not going to be more than a million so it should be a workable option.

ADD REPLY
0
Entering edit mode

That is a nice suggestion. Is it well adapted to long ONT reads tho ? On their page they specify that it's faster than blast in the case of short Illumina reads, but must be different for ONT reads.

ADD REPLY
1
Entering edit mode

I was able to search nr using DIAMOND with ~1070 fastq sequences contained in a ONT data file. Reads ranged from 375 bp to 126000 bp. I left all settings at default. It appears that DIAMOND reports a max of 25 hits per sequence. It took about ~6 h. DIAMOND can make a SAM file.

ADD REPLY
0
Entering edit mode

I am going to try with a set of ONT reads I have. Will let you know.

ADD REPLY
0
Entering edit mode

not sure if it would serve your request but you might have a look at NanoPlot ?

ADD REPLY
0
Entering edit mode

From what I understand from NanoPlot, it doesn't seems it will help me to answer to the "What did I have sequenced" question. I was more looking for a tool or a method that can determinate what organisme the reads originate from, which will both check for contamination and genome characterization (like a QC check). But thanks for the suggestion I'll keep that tool in mind !

ADD REPLY
0
Entering edit mode

NanoPlot does not exactly do what OP is asking for :)

ADD REPLY
0
Entering edit mode

From the NanoPlot author :o)

ADD REPLY
1
Entering edit mode

I can imagine (using minimap2) it would be fairly easy to write a fastq-screen-for-long-reads. But then again, I should be writing my thesis instead.

ADD REPLY
1
Entering edit mode

I may give a try to that task myself Here is a "good luck" upvote for your thesis :)

ADD REPLY
1
Entering edit mode

Thanks!

I would approach it in Python, roughly similar to what I did for NanoLyse (which removes lambda reads from a fastq file): https://github.com/wdecoster/nanolyse/blob/master/nanolyse/NanoLyse.py#L101

I would use the python API for minimap2, mappy, and just check "does an alignment exist for this read on that genome" and keep count of that.

ADD REPLY
0
Entering edit mode

Was thinking with python as well, as the main goal is to integrate this analysis into an existing python workflow that seemed to be the easiest solution indeed. Even better if there is an API. Thanks for the reference of NanoLyse, I'll have a look :)

ADD REPLY
0
Entering edit mode

Let me know if you get stuck - I need some coding during my writing to remain sane.

ADD REPLY
0
Entering edit mode

Who doesn't ;) I'll keep you updated when I'll try something then !

ADD REPLY
0
Entering edit mode

Did you guys manage to write a fastq-screen-for-long-reads based on minimap2? I am considering the same idea and wanted to make sure I was not reinventing the wheel.

ADD REPLY
0
Entering edit mode

Hi Benoit ! Nope, in the end I had some others projects I had to finish before this one, so I did not had any time to think about this... So you won't be reinventing the wheel ; ) I'll be very interested to test it if you are going to work on that ! Cheers, Roxane

ADD REPLY
1
Entering edit mode

Hi Roxane, I've modified FastQ Screen so it now includes minimap2 as one of the alignment options, and can therefore process long-read data. I've submitted the code changes to the team maintaining FastQ Screen and I am hoping they are going to look into releasing it as the next version of FastQ Screen.

ADD REPLY
0
Entering edit mode

my bad indeed :/ , understood OP's question wrongly (completely wrong even)

ADD REPLY
0
Entering edit mode

Do you expect specific contaminants or are you just trying to do a survey of what is there? You could always use minimap with the expected genome and figure out what remain unaligned.

ADD REPLY
0
Entering edit mode

I'm not looking for a particular contaminant, indeed it's more like a survey of my reads content. Minimap and miniasm seems indeed interesting thanks !

ADD REPLY
0
Entering edit mode
6.4 years ago
Joe 21k

I'd run the reads and/or assembled contings through Braken/Kraken.

It's designed for metagenomic studies, but it will tell you the distributions of your reads amongst different taxa. If you've got 2 different genomes in there (one as a contaminant for example) it should stick out like a sore thumb.

Some info in this link for instance to get started:

https://www.microbe.net/2017/04/27/why-use-bracken-instead-of-kraken/

ADD COMMENT
0
Entering edit mode

Do you know if these can use long reads?

ADD REPLY
0
Entering edit mode

I’m not 100% sure, but they can bin contigs, so it should work or at least be kinda coercable.

ADD REPLY
0
Entering edit mode

That was a nice suggestion, but I feel like it won't be suited for my purpose as it's a metagenomic oriented tool that will only perform for bacterial DNA right ..?

ADD REPLY
0
Entering edit mode

I assume you can replace the target database with your sequences of interest.

ADD REPLY
0
Entering edit mode

Ah yes, I believe it will only work for bacteria 'out of the box'. No organism was specified so I thought it was worth suggesting ;)

You may be able to modify/expand on the approach however.

ADD REPLY
0
Entering edit mode

Yeah I did not specified and as I said, it was a good suggestions anyway because I did not knew the tool :) I'm surveying all the tools to think about a nice approach and I'll keep that one in mind for sure. The thing is that I don't really know yet what are my sequences of interest... Have to put more precise ideas toward my question !

ADD REPLY
0
Entering edit mode

Update:

I spoke to a few people including one of the authors who had this to say:

We use Kraken to filter contaminants in a lot of our projects as well. Its actually the program I used in filtering out contaminant sequences of the eukaryotic draft genomes in my latest paper. Other people in lab use it in assembly projects to filter potential contaminating bacterial sequences. We also use it to remove any non informative vector or human sequences out of our samples when working on diagnoses.

With regard to building a database:

yes. It would be easiest to just build a database of everything you want to exclude from your sample and then taking all unclassified reads to the next step, but you can also just use any database and exclude sequences that classified as particular taxons

.

but that would require writing another script to parse those out

the first option can be achieved by kraken/kraken2 itself using the --unclassified-reads flag I think

ADD REPLY

Login before adding your answer.

Traffic: 1266 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6