Question

How To Compare The Performance Of Bwa And Bowtie

1

Entering edit mode

12.9 years ago

Arpssss ▴ 40

I am experimenting with BowTie vs BWA. Now, I have to do experiments for various length reads. So, I am trying to find 180, 200, 240 bp read length databases. Can anybody help me where and how I can get those databases. I have tried to find here. However, unable to find. Can anybody help me on this ? Thanks in advance.

alignment bwa bowtie read genome • 5.0k views

ADD COMMENT • link updated 12.9 years ago by Ashutosh Pandey 12k • written 12.9 years ago by Arpssss ▴ 40

score 5 · Answer 1 · 2012-08-16

5

Entering edit mode

12.9 years ago

Istvan Albert 102k

You may also want to experiment by generating reads with a read simulator from different genomes. It is a quite eye opening experience to see perfect reads not matching back to the genome.

ADD COMMENT • link 12.9 years ago by Istvan Albert 102k

0

Entering edit mode

Bowtie2 comes with a nice read simulator. Istvan is right. Understanding multiple hits and mapping oddities is a great exercise.

ADD REPLY • link 12.9 years ago by Zev.Kronenberg 12k

score 2 · Answer 2 · 2012-08-16

2

Entering edit mode

12.9 years ago

Zev.Kronenberg 12k

The short read archive will have what you want.

http://www.ncbi.nlm.nih.gov/sra/

ADD COMMENT • link 12.9 years ago by Zev.Kronenberg 12k

0

Entering edit mode

But, I have not find any 240 bp read length database. How to find that ?

ADD REPLY • link 12.9 years ago by Arpssss ▴ 40

1

Entering edit mode

That's a really long read for Illumina. I don't know if anyone has sequenced reads that long, or if the Illumina machines are capable of that. I'm assuming you are interested in Illumina, and you want a 240bp read on each end?

If you follow my suggestion below, the longest you'll find from 1000 Genomes are 31 200-bp runs, 1156 202-bp runs, and 32 216-bp runs.

There are a lot of 454 runs that have longer reads though, but people typically don't use bwa or bowtie for that.

ADD REPLY • link 12.9 years ago by matted 7.8k

0

Entering edit mode

200 bp runs is OK for for me where to find it ? I have not found any link for that.

ADD REPLY • link 12.9 years ago by Arpssss ▴ 40

score 2 · Answer 3 · 2012-08-16

It is sometimes hard to find exactly what you want in the SRA. The suggestion to simulate is probably best, but it depends on what you want to do.

If you want to have more data than you can handle, check out the entire set of 1000 Genomes sequencing runs here (~39k sequencing runs total).

You'll have a row for each run like this (edited to come closer to fitting on the screen):

FASTQ_FILE                      CENTER_NAME SAMPLE_NAME POPULATION  EXPERIMENT_ID   INSTRUMENT_PLATFORM INSTRUMENT_MODEL    LIBRARY_LAYOUT  WITHDRAWN   READ_COUNT  BASE_COUNT

data/NA19238/sequence_read/ERR000018.filt.fastq.gz  BGI NA19238 YRI ERX000014   ILLUMINA    Illumina Genome Analyzer    SINGLE      0           9280498 334097928
data/NA19238/sequence_read/ERR000019.filt.fastq.gz  BGI NA19238 YRI ERX000014   ILLUMINA    Illumina Genome Analyzer    SINGLE      0           9571982 344591352
data/NA19240/sequence_read/ERR000020.filt.fastq.gz  BGI NA19240 YRI ERX000016   ILLUMINA    Illumina Genome Analyzer II PAIRED      0           149044  5365584

You can look at the READ_COUNT and BASE_COUNT columns to calculate the read length, maybe like this:

$ cat 20120522.sequence.index | cut -f 1,24,25 | awk '{print $1"\t"$3/$2}'
data/NA19238/sequence_read/ERR000018.filt.fastq.gz  36
data/NA19238/sequence_read/ERR000019.filt.fastq.gz  36
data/NA19240/sequence_read/ERR000020.filt.fastq.gz  36
data/NA19240/sequence_read/ERR000020_1.filt.fastq.gz    36
...

You can look through the results to find runs that meet your specifications, or runs that could meet them if you trimmed the reads.

score 1 · Answer 4 · 2012-08-16

Hi Arpssss,

"matted" is right that you can't find Illumina reads that are greater than 200 bp in length. 454 sequencer produces reads that could be 500 bp but you dont use bowtie or BWA to align those reads (though both tools are now capable of aligning reads up to 1000 base pairs). I did what "matted" tried to explain you right above my answer. Here is the link to the tab delimited file https://dl.dropbox.com/u/6854830/1000genome_Arpssss.txt . I have added a column called read length in the very end. The spreadsheet has all the runs that have avg. length ranging from 150 to max for the 1000 genome project.