Entering edit mode
20 months ago
j.zeidler
•
0
Hi there,
I have sequencing data from a Nanopore run which I used to create a de novo assembly using Flye. Now I want to plot the genome coverage (in %) over the run duration (hh:mm:ss). Can anyone suggest a tool for creating such a graph?
Thanks!
I am not sure how you would do that unless you are simply going to plot the total amount of cumulative data produced after each hour and calculate fold coverage.
Take a look at
PycoQC
(LINK) which has a graph that shows amount of sequence produced over the length of the run.With this, you mean how many bases in your assembly are covered by at least 1 read? i.e. if the first 20kb read aligns with your 2Mb assembly you are at 1%? Do you care about read clipping?
This is a bit complicated. Lots easier would be to get a plot of gigabases sequenced vs time (cumulative yield plot).
That´s exactly what I´m thinking of. Since the clipped regions are excluded from the assembly I think they also should be excluded from the genome coverage graph.
I know plotting the cumulative yield should be much easier (btw: is there a tool that outputs this kind of data as a table?), but e.g. for data from a metagenomic sequencing run, it would be interesting to illustrate the coverage of each genome over time.
Can you write some python to get this done? I could be able to help, but don't have too much free time.
Basically, I would parse and load both your fastq and bam files in a pandas dataframe and join them on the read id. This would give you, for every read ID, its sequencing start_time and its alignment start and stop. I would initialize a list or numpy array with zeroes, for the entire length of your chromosome. After sorting your dataframe on start_time you can iterate over the reads, and for every read that is aligned you change the corresponding positions in your coverage array to ones. After every read you sum the array and divide that sum by your contig length to get the coverage at that timepoint.
Hope that helps.
The code in this repository might be a good starting point: https://github.com/wdecoster/nanoretrotect
Basically, you need to combine mapping information with the time stamps you have in either the fastq or summary file(s).
Thanks a lot for your help! I think this is a good starting point. I will try to implement this in Python and keep you posted.
I have next week off, so don't despair if you don't hear from me for a while :)