Question

Genome sequenced [%] over time graph (Nanopore)

0

Entering edit mode

20 months ago

j.zeidler • 0

Hi there,

I have sequencing data from a Nanopore run which I used to create a de novo assembly using Flye. Now I want to plot the genome coverage (in %) over the run duration (hh:mm:ss). Can anyone suggest a tool for creating such a graph?

Thanks!

flye Nanopore graph genome assembly • 1.4k views

ADD COMMENT • link updated 20 months ago by WouterDeCoster 47k • written 20 months ago by j.zeidler • 0

0

Entering edit mode

I am not sure how you would do that unless you are simply going to plot the total amount of cumulative data produced after each hour and calculate fold coverage.

Take a look at PycoQC (LINK) which has a graph that shows amount of sequence produced over the length of the run.

ADD REPLY • link 20 months ago by GenoMax 147k

0

Entering edit mode

the genome coverage (in %)

With this, you mean how many bases in your assembly are covered by at least 1 read? i.e. if the first 20kb read aligns with your 2Mb assembly you are at 1%? Do you care about read clipping?

This is a bit complicated. Lots easier would be to get a plot of gigabases sequenced vs time (cumulative yield plot).

ADD REPLY • link 20 months ago by WouterDeCoster 47k

0

Entering edit mode

That´s exactly what I´m thinking of. Since the clipped regions are excluded from the assembly I think they also should be excluded from the genome coverage graph.

I know plotting the cumulative yield should be much easier (btw: is there a tool that outputs this kind of data as a table?), but e.g. for data from a metagenomic sequencing run, it would be interesting to illustrate the coverage of each genome over time.

ADD REPLY • link 20 months ago by j.zeidler • 0

3

Entering edit mode

Can you write some python to get this done? I could be able to help, but don't have too much free time.

Basically, I would parse and load both your fastq and bam files in a pandas dataframe and join them on the read id. This would give you, for every read ID, its sequencing start_time and its alignment start and stop. I would initialize a list or numpy array with zeroes, for the entire length of your chromosome. After sorting your dataframe on start_time you can iterate over the reads, and for every read that is aligned you change the corresponding positions in your coverage array to ones. After every read you sum the array and divide that sum by your contig length to get the coverage at that timepoint.

Hope that helps.

ADD REPLY • link 20 months ago by WouterDeCoster 47k

0

Entering edit mode

The code in this repository might be a good starting point: https://github.com/wdecoster/nanoretrotect

Basically, you need to combine mapping information with the time stamps you have in either the fastq or summary file(s).

ADD REPLY • link 20 months ago by WouterDeCoster 47k

0

Entering edit mode

Thanks a lot for your help! I think this is a good starting point. I will try to implement this in Python and keep you posted.