I have been working with some PacBio RSII data in SMRT Portal version 2.3.0, and when looking through the report files for the P_Filter step of an assembly, I get a pre-filter read total of 150,292 (465.0 Mbp). However, upon uploading the data to SRA (https://trace.ncbi.nlm.nih.gov/Traces/sra/?run=SRR5811295), I see that the number of 'spots' is 163,482 (950.0 Mbp). As far as I can tell from various genome announcement articles, the number of spots should be equivalent to the number of raw reads, but I can find no mention of the number 163,482 anywhere in the files of the associated SMRT Portal job [note: see EDIT below]. Apologies if there's a simple answer to this that I'm overlooking, but can anybody please help me to figure out why this apparent discrepancy exists, and why the number of 'spots' reported at SRA appears nowhere in the SMRT Portal reports?
EDIT: One additional piece of information - the only time I can find the number 163,482 in relation to the job files is the number of lines in the data/filtered_summary.csv
file. Grepping for a 1
in the PassedFilter
column gives the expected post-filter read number of 68,619, however I can't currently figure out the criterion for getting to the 'pre-filter' read number of 150,292...
Just based on your post, I wonder if P_Filter is the number of reads passing filter (not sure what the criteria are if that is the case) whilst the number of spots is all reads. Did you upload a BAM file to the SRA or a FASTQ file and if so, can you type
cat input.fastq|paste - - - - |wc -l
to see how many total sequences you have in your FASTQ file that you uploaded to the SRA?I uploaded
.bax.h5
and.bas.h5
files to SRA, rather than a.fastq
fileWell perhaps it still has to do with reads passing filter...not sure
I don't know whether the edit at the bottom of my original post is of any more help? It looks almost as if there's an additional 'pre-pre-filter', but I can't find any mention of it in the output, hence my confusion...