Question

Blog:Want to use fastq-dump to download pacbio data? Read this before

8

Entering edit mode

7.2 years ago

pmarijon ▴ 140

TL;DR I think not, prefer use dextractor

Recently I downloaded the NCTC11131 data set stored at EBI via the fastq-dump tool. I would like to assemble it with miniasm and minimap in order to reproduce results from te HINGE paper.

The obtained assembly was very fragmented with 34 contigs while HINGE's authors had only 2 contigs with the same dataset and software.

After discussions with HINGE's author we confirmed that we use the same version of minimap and miniasm but they downloaded bas.h5 directly and then used dextractor to extract the reads.

When I use same procedure than HINGE's authors to get reads, I get a similar assembly.

Thus the difference is in the way of getting the reads.

With fastq-dump we have:

Total sequences: 162577
Total length: 1366.615824 Mb
Longest sequence: 80.922 kb
Shortest sequence: 3 b
Mean Length: 8.405 kb
Median Length: 4.314 kb
N10: 3615 sequences; L10: 32.743 kb
N50: 25330 sequences; L50: 20.172 kb
N90: 71907 sequences; L90: 5.781 kb

With dextractor we have:

Total sequences: 167025
Total length: 769.911448 Mb
Longest sequence: 32.006 kb
Shortest sequence: 500 b
Mean Length: 4.609 kb
Median Length: 3.497 kb
N10: 4379 sequences; L10: 14.311 kb
N50: 39953 sequences; L50: 5.903 kb
N90: 121202 sequences; L90: 2.48 kb

fastq-dump extracts more bases and reads are longer compared to dextractor.

When mapping dextractor reads against fastq-dump ones, we realized that dextractor reads are often contained in fastq-dump reads.

fastq-dump id: ERR972361.54 read length: 9565
    begin   end dextractor_id
    3216    7116    m150526_220338_00127_c1008…1823177310081531_s1_p0/53/3210_7121
    7171    8309    m150526_220338_00127_c1008…1823177310081531_s1_p0/53/7167_8310
fastq-dump id: ERR972361.161 read length: 18592
    begin   end dextractor_id
    0   11058   m150526_220338_00127_c1008…1823177310081531_s1_p0/160/0_11058
    3784    11030   m150526_220338_00127_c1008…1823177310081531_s1_p0/160/11102_18592
    11104   18589   m150526_220338_00127_c1008…1823177310081531_s1_p0/160/11102_18592
fastq-dump id: ERR972361.192 read length: 23763
    begin   end dextractor_id
    86  6226    m150526_220338_00127_c10080…1823177310081531_s1_p0/191/6351_12100
    1733    6303    m150526_220338_00127_c1008…1823177310081531_s1_p0/191/1729_6304
    6350    12097   m150526_220338_00127_c1008…1823177310081531_s1_p0/191/6351_12100
    6434    11362   m150526_220338_00127_c1008…1823177310081531_s1_p0/191/18486_23763
    12146   18443   m150526_220338_00127_c1008…1823177310081531_s1_p0/191/12145_18442
    18488   23762   m150526_220338_00127_c1008…1823177310081531_s1_p0/191/18486_23763
fastq-dump id: ERR972361.196 read length: 8567
    begin   end dextractor_id
    4380    5912    m150526_220338_00127_c1008…1823177310081531_s1_p0/195/4380_5915
    5963    8376    m150526_220338_00127_c1008…1823177310081531_s1_p0/195/5961_8376

So a question arises. Does not fastq-dump extract pacbio subreads (or badly) and only raw reads?

I didn't found any information that recommends not to use fastq-dump for pacbio data sets, but I may have missed something. Rob Edwards already emphasizes fastq-dump is not well documented https://edwards.sdsu.edu/research/fastq-dump/

In all cases, I think it's safer not to use fastq-dump for extracting pacbio reads, and I would recommend to use dextractor.

Version of the tools used :

minimap 0.2-r123
miniasm 0.2-r128
fastq-dump 2.8.2
dextractor 1.0p2

fastq-dump pacbio • 6.3k views

ADD COMMENT • link updated 2.3 years ago by Ram 45k • written 7.2 years ago by pmarijon ▴ 140

0

Entering edit mode

Hi pmarijon,

Although I suspect it's not intended as such, this post looks a lot like a question, perhaps also by the question in the title. Perhaps you could change that to make it more clear that you are reporting on your findings, rather than opening a question.

Cheers,
Wouter

ADD REPLY • link 7.2 years ago by WouterDeCoster 48k

0

Entering edit mode

Hi WouterDeCoster,

Thanks I change the title I hope it's more clear now ? (now this is a question :) )

Pierre

ADD REPLY • link 7.2 years ago by pmarijon ▴ 140

4

Entering edit mode

I'd thought more something like "Don't use fastq-dump for PacBio data" to be explicit and not look like a click-bait article, but okay :-)

ADD REPLY • link 7.2 years ago by WouterDeCoster 48k

6

Entering edit mode

"You won't believe what happened to this PhD student after he ran fastq-dump"

ADD REPLY • link 7.2 years ago by Rayan Chikhi ★ 1.6k

0

Entering edit mode

"You'd be surprised by the incredible story of this simple student when he uses the wrong tools to get reads"

ADD REPLY • link 7.2 years ago by pmarijon ▴ 140

0

Entering edit mode

Question being: how would one know that fastq-dump is the wrong tool in this case?

ADD REPLY • link 7.2 years ago by bastien.chevreux • 0

0

Entering edit mode

Hi,

Sorry I don't understand your question.

ADD REPLY • link 7.1 years ago by pmarijon ▴ 140

1

Entering edit mode

I think Bastien wants to know why we think the output of fastq-dump is "wrong". I recall it has to do with how the raw .bax.h5 PacBio files get converted to FASTQ. There seems to be additional step(s) needed in order to split raw reads and avoid chimeric reads. dextractor does them, but fastq-dump, in that case, doesn't.

ADD REPLY • link 7.0 years ago by Rayan Chikhi ★ 1.6k

2

Entering edit mode

or comparison between fastq-dump and dextractor for downloading and processing pacbio data.

ADD REPLY • link 7.2 years ago by cpad0112 21k

score 1 · Answer 1 · 2018-05-11

1

Entering edit mode

7.1 years ago

shengweima ▴ 60

Can you share the detail commanf of dextractor? how to filter the raw file

enter code here

ADD COMMENT • link 7.1 years ago by shengweima ▴ 60

0

Entering edit mode

I just run dextract on bax.h5 without any option.

dextract *.bax.h5

I didn't made any filter on the raw file.

ADD REPLY • link 7.1 years ago by pmarijon ▴ 140