Question

Comparing Bowtie H_Sapiens_Asm Blastn -Task Blastn-Short -Db Human_Genomic

0

Entering edit mode

13.1 years ago

W Langdon ▴ 90

I am thinking the current supplied (pre-formatted) blastn humangenomic database from NCBI contains many many duplications of the same sequences. Is this correct? I think I only need the current (GRCh37.p5) release of the human genome but blastn typically gives about a dozen identical matches for each (short, 36 nt) query. This must be a common problem. Has anyone built a version of the database which contains _only GRCh37 sequences?

A related problem is that bowtie uses its own database format and the version I have is slightly older than NCBI's current reference sequence. At least that is my current thought about why output from bowtie and blastn do not tie up. Perhaps this is also a common problem? Has anyone succeeded in building the blastn and bowtie databases from a common source? Perhaps this is already available but I have missed it?

Any help or suggestions would be most welcome

Bill

ps: the font on this www page is too small:-(

blast bowtie • 3.4k views

ADD COMMENT • link 13.0 years ago by W Langdon ▴ 90

score 1 · Answer 1 · 2011-10-26

1

Entering edit mode

13.1 years ago

Sean Davis 27k

You can build a blast database using formatdb. The bowtie-build command (comes with bowtie) will build a bowtie index. If you use the same FASTA input to both, you will be as close as possible to having the same database. However, BLAST and bowtie are entirely different algorithms built to solve different but related problems, so there is little chance that the results using each will be identical. Assuming that your data are from next-gen sequencing, I'm not sure that using blast is going to be your best bet, but only you know your application.

ADD COMMENT • link 13.1 years ago by Sean Davis 27k

0

Entering edit mode

Thanks Sean. This was pretty much what I feared. (If some clever person has already done this, it would be better to re-use it). I will have another look at NCBI and see if I can easily filter out the non-GRCh37 sequences.

Do you know why the NCBI blastn database contains so many duplicates (am I right in thinking of them as duplicates)?

ADD REPLY • link 13.1 years ago by W Langdon ▴ 90

0

Entering edit mode

ps. It looks like bowtie now has a prebuilt v37 database.

ADD REPLY • link 13.1 years ago by W Langdon ▴ 90

0

Entering edit mode

BTW: bowtie 2's MANUAL says

Bowtie 2's .bt2 index format is different from Bowtie 1's .ebwt format, and they are not compatible with each other.

pity I did not know that before I downloaded h_sapiens_37_asm.ebwt.zip

ADD REPLY • link 13.1 years ago by W Langdon ▴ 90

score 0 · Answer 2 · 2011-10-28

0

Entering edit mode

13.1 years ago

W Langdon ▴ 90

Have created a new version of the bowtie2 database build script make_h_sapiens_ncbi37.p5 (my thanks to Ben Langmead). (Run time approx 3.5 hours)

Bill

See http://www.cs.ucl.ac.uk/staff/W.Langdon/installing_bowtie2

ADD COMMENT • link 13.1 years ago by W Langdon ▴ 90