Blog:Want to use fastq-dump to download pacbio data? Read this before
1
8
Entering edit mode
6.6 years ago
pmarijon ▴ 140

TL;DR I think not, prefer use dextractor

Recently I downloaded the NCTC11131 data set stored at EBI via the fastq-dump tool. I would like to assemble it with miniasm and minimap in order to reproduce results from te HINGE paper.

The obtained assembly was very fragmented with 34 contigs while HINGE's authors had only 2 contigs with the same dataset and software.

After discussions with HINGE's author we confirmed that we use the same version of minimap and miniasm but they downloaded bas.h5 directly and then used dextractor to extract the reads.

When I use same procedure than HINGE's authors to get reads, I get a similar assembly.

Thus the difference is in the way of getting the reads.

With fastq-dump we have:

Total sequences: 162577
Total length: 1366.615824 Mb
Longest sequence: 80.922 kb
Shortest sequence: 3 b
Mean Length: 8.405 kb
Median Length: 4.314 kb
N10: 3615 sequences; L10: 32.743 kb
N50: 25330 sequences; L50: 20.172 kb
N90: 71907 sequences; L90: 5.781 kb

With dextractor we have:

Total sequences: 167025
Total length: 769.911448 Mb
Longest sequence: 32.006 kb
Shortest sequence: 500 b
Mean Length: 4.609 kb
Median Length: 3.497 kb
N10: 4379 sequences; L10: 14.311 kb
N50: 39953 sequences; L50: 5.903 kb
N90: 121202 sequences; L90: 2.48 kb

fastq-dump extracts more bases and reads are longer compared to dextractor.

When mapping dextractor reads against fastq-dump ones, we realized that dextractor reads are often contained in fastq-dump reads.

fastq-dump id: ERR972361.54 read length: 9565
    begin   end dextractor_id
    3216    7116    m150526_220338_00127_c1008…1823177310081531_s1_p0/53/3210_7121
    7171    8309    m150526_220338_00127_c1008…1823177310081531_s1_p0/53/7167_8310
fastq-dump id: ERR972361.161 read length: 18592
    begin   end dextractor_id
    0   11058   m150526_220338_00127_c1008…1823177310081531_s1_p0/160/0_11058
    3784    11030   m150526_220338_00127_c1008…1823177310081531_s1_p0/160/11102_18592
    11104   18589   m150526_220338_00127_c1008…1823177310081531_s1_p0/160/11102_18592
fastq-dump id: ERR972361.192 read length: 23763
    begin   end dextractor_id
    86  6226    m150526_220338_00127_c10080…1823177310081531_s1_p0/191/6351_12100
    1733    6303    m150526_220338_00127_c1008…1823177310081531_s1_p0/191/1729_6304
    6350    12097   m150526_220338_00127_c1008…1823177310081531_s1_p0/191/6351_12100
    6434    11362   m150526_220338_00127_c1008…1823177310081531_s1_p0/191/18486_23763
    12146   18443   m150526_220338_00127_c1008…1823177310081531_s1_p0/191/12145_18442
    18488   23762   m150526_220338_00127_c1008…1823177310081531_s1_p0/191/18486_23763
fastq-dump id: ERR972361.196 read length: 8567
    begin   end dextractor_id
    4380    5912    m150526_220338_00127_c1008…1823177310081531_s1_p0/195/4380_5915
    5963    8376    m150526_220338_00127_c1008…1823177310081531_s1_p0/195/5961_8376

So a question arises. Does not fastq-dump extract pacbio subreads (or badly) and only raw reads?

I didn't found any information that recommends not to use fastq-dump for pacbio data sets, but I may have missed something. Rob Edwards already emphasizes fastq-dump is not well documented https://edwards.sdsu.edu/research/fastq-dump/

In all cases, I think it's safer not to use fastq-dump for extracting pacbio reads, and I would recommend to use dextractor.

Version of the tools used :

  • minimap 0.2-r123
  • miniasm 0.2-r128
  • fastq-dump 2.8.2
  • dextractor 1.0p2
fastq-dump pacbio • 5.6k views
ADD COMMENT
0
Entering edit mode

Hi pmarijon,

Although I suspect it's not intended as such, this post looks a lot like a question, perhaps also by the question in the title. Perhaps you could change that to make it more clear that you are reporting on your findings, rather than opening a question.

Cheers,
Wouter

ADD REPLY
0
Entering edit mode

Hi WouterDeCoster,

Thanks I change the title I hope it's more clear now ? (now this is a question :) )

Pierre

ADD REPLY
4
Entering edit mode

I'd thought more something like "Don't use fastq-dump for PacBio data" to be explicit and not look like a click-bait article, but okay :-)

ADD REPLY
6
Entering edit mode

"You won't believe what happened to this PhD student after he ran fastq-dump"

ADD REPLY
0
Entering edit mode

"You'd be surprised by the incredible story of this simple student when he uses the wrong tools to get reads"

ADD REPLY
0
Entering edit mode

Question being: how would one know that fastq-dump is the wrong tool in this case?

ADD REPLY
0
Entering edit mode

Hi,

Sorry I don't understand your question.

ADD REPLY
1
Entering edit mode

I think Bastien wants to know why we think the output of fastq-dump is "wrong". I recall it has to do with how the raw .bax.h5 PacBio files get converted to FASTQ. There seems to be additional step(s) needed in order to split raw reads and avoid chimeric reads. dextractor does them, but fastq-dump, in that case, doesn't.

ADD REPLY
2
Entering edit mode

or comparison between fastq-dump and dextractor for downloading and processing pacbio data.

ADD REPLY
1
Entering edit mode
6.5 years ago
shengweima ▴ 60

Can you share the detail commanf of dextractor? how to filter the raw file

enter code here
ADD COMMENT
0
Entering edit mode

I just run dextract on bax.h5 without any option.

dextract *.bax.h5

I didn't made any filter on the raw file.

ADD REPLY

Login before adding your answer.

Traffic: 1964 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6