I am thinking the current supplied (pre-formatted) blastn humangenomic database from NCBI contains many many duplications of the same sequences. Is this correct? I think I only need the current (GRCh37.p5) release of the human genome but blastn typically gives about a dozen identical matches for each (short, 36 nt) query. This must be a common problem. Has anyone built a version of the database which contains _only GRCh37 sequences?
A related problem is that bowtie uses its own database format and the version I have is slightly older than NCBI's current reference sequence. At least that is my current thought about why output from bowtie and blastn do not tie up. Perhaps this is also a common problem? Has anyone succeeded in building the blastn and bowtie databases from a common source? Perhaps this is already available but I have missed it?
Any help or suggestions would be most welcome
Bill
ps: the font on this www page is too small:-(
Thanks Sean. This was pretty much what I feared. (If some clever person has already done this, it would be better to re-use it). I will have another look at NCBI and see if I can easily filter out the non-GRCh37 sequences.
Do you know why the NCBI blastn database contains so many duplicates (am I right in thinking of them as duplicates)?
ps. It looks like bowtie now has a prebuilt v37 database.
BTW: bowtie 2's MANUAL says
Bowtie 2's
.bt2
index format is different from Bowtie 1's.ebwt
format, and they are not compatible with each other.pity I did not know that before I downloaded h_sapiens_37_asm.ebwt.zip