Nanopore data and downstream anlysis
1
1
Entering edit mode
6.7 years ago
MAPK ★ 2.1k

I have a few microbiome data sequenced with Nanopore Minion. For each run, I have pass, fail and skip directories. Within the pass directory, I also have 0:10 (10 different) subdirectories. Would someone please explain me the difference between pass, fail and skip data and which data I should be analyzing. I also want to understand what 0 to 10 different subdirectories within pass directory mean? Thank you for your help.

nanopore • 10k views
ADD COMMENT
2
Entering edit mode
6.7 years ago

You did not explain how you base called the data, but I assume you used live base calling in MinKnow. I use albacore, and I would recommend you do the same for future runs Ûpdate July 2019: don't use albacore anymore, use guppy!]. Live base calling creates problems, occasionally, and running albacore later on a server/cluster is usually beneficial.

The categories you asked about:

  • pass: reads have an average quality score > Q7
  • fail: reads have an average quality score < Q7
  • skip: I think these reads were not basecalled due to time constraints, you can still basecall those using albacore

Since I haven't used minknow for basecalling I'm not entirely sure on the skip part

Whether you want to combine the pass and fail reads is up to you and depends on your application. I tend to keep them both.

The subdirectories are made per 4000 reads (I believe that's the default) to avoid directories with far too many files (fast5 format).

ADD COMMENT
0
Entering edit mode

Thank you so much for your answer. Yes the base calling was done using live basecalling method. Now I have some more questions:

  1. Would you suggest to concatenate all the fasta reads extracted from fast5 files from all subdirectories?
  2. Do I need to separate 1D and 2D reads and what type of reads I should be using?
  3. What would be the downstream analysis I can perform and tools(beside poretools) I can use for these reads ?
  4. What are the circumstances you should be using fasta reads extracted from Fast5 vs. the fastq files from the run itself ?

Thank you again for your help.

ADD REPLY
1
Entering edit mode
  1. You should be able to get a fastq file, and yes you can concatenate these. You could also choose to concatenate them per directory, and have multiple fastq files for parallel processing (depending on your needs)
  2. I have no idea which type of sequencing you have performed, but 2D has been deprecated for quite a while now. So this is old data?
  3. I don't know what you biological question is. I've written NanoPack: a set of scripts for visualizing and processing long read sequencing data, which might be useful for you (feedback welcome)
  4. I don't see a reason to use the fasta reads
ADD REPLY
0
Entering edit mode

Ok. Thank you. No this is new data. I also have two directories with both multiple fast5 and one fastq files for each microbiome sample. Should I use the fastq file generated by live basecalling method or should I convert all fast5 files to fastq?

ADD REPLY
0
Entering edit mode

I expect the fastq to contain all reads from that folder, you can easily count the files to verify that, although some fast5 may rarely fail basecalling and not lead to a read.

ADD REPLY
0
Entering edit mode

This isn't accurate. Having an average Phred score of >7 is a necessary but not sufficient condition for a read to go into the "pass" category. For 1D-squared runs, the read also needs to have been of both strands; if only a single strand goes through the pore, then the read will still be basecalled and may get a quality score above 7, but even if it does, it'll still go into the fail category.

For 1D runs, I can tell that there's some other necessary condition for a read to "pass", because I see reads in the "fail" category with Phred scores above 7 in my data. But I haven't figured out what that condition is, yet.

ADD REPLY
0
Entering edit mode

Ah yes, 1D^2 might be different, but I don't see that being used a lot. Thanks for the heads up!

For the normal 1D reads, are those reads above Q7 using your calculations of average quality, or did you use the score from the sequencing-summary.txt?

ADD REPLY
0
Entering edit mode

Using my calculation - which was flawed, as was pointed at https://bioinformatics.stackexchange.com/q/8735/3144 by... oh, by you. :)

ADD REPLY

Login before adding your answer.

Traffic: 2888 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6