Entering edit mode
6.4 years ago
nkinney06
▴
140
I recently installed blast and downloaded the precomputed human_genomic.*tar.gz database available here:
ftp://ftp.ncbi.nlm.nih.gov/blast/db/
I tested my installation with the following fasta file:
cat test_query.fa
>chr13:83987454-83987503
GCTGGGTGGTCAGCGCTGGTTCCATGGGCAGTAATGATTTCCTCTGTTTT
when I blast against my local database I see the primary assembly but also many additional hits:
>NC_000013.11 Homo sapiens chromosome 13, GRCh38.p7 Primary Assembly <- matches my test query
Query 1 GCTGGGTGGTCAGCGCTGGTTCCATGGGCAGTAATGATTTCCTCTGTTTT 50
Sbjct 83987454 GCTGGGTGGTCAGCGCTGGTTCCATGGGCAGTAATGATTTCCTCTGTTTT 83987503
>NT_024524.15 Homo sapiens chromosome 13 genomic scaffold, GRCh38.p7 Primary
Query 1 GCTGGGTGGTCAGCGCTGGTTCCATGGGCAGTAATGATTTCCTCTGTTTT 50
Sbjct 65579348 GCTGGGTGGTCAGCGCTGGTTCCATGGGCAGTAATGATTTCCTCTGTTTT 65579397
>GL583019.1 Homo sapiens unplaced genomic scaffold scaffold_39, whole genome
Query 1 GCTGGGTGGTCAGCGCTGGTTCCATGGGCAGTAATGATTTCCTCTGTTTT 50
Sbjct 731735 GCTGGGTGGTCAGCGCTGGTTCCATGGGCAGTAATGATTTCCTCTGTTTT 731686
>Lots more results...
My question is what is the source of all additional sequences that this blast database uses?
I have looked at the README (available at ftp://ftp.ncbi.nlm.nih.gov/blast/db/README) but the information there is not very thorough. Is there a complete list of what's in this database? Thanks!
this is better than the README file but when I use blast is says
Perhaps the database also includes some older assemblies and unplaced contigs?
Take a look to see what is included using this command:
That said NCBI is offering something different on their human genome blast page where I captured the above screenshot from.