So, I am trying to use the human_genomic_transcript database. This, for some reason, is not on the NCBI FTP website, however, according to this it should be. So instead, I went to https://www.ncbi.nlm.nih.gov/assembly/GCF_000001405.39 and at the top right area of the page there is "download assembly". So I clicked it. Downloaded the "RNA FASTA", "RNA from genomic FASTA", and "CDS from genomic FASTA".
These successfully run. No errors given. However, if I look in the directory there is an *.ndb-lock file. I have no idea why it is there, it is not there for the other databases I have built from the fasta, so I am wondering if this is the cause for the error in the title.
Yes, it is when doing the "blastn" search, specifically, dc-megablast task. However, the error clearly states "BLAST Database error" :) so I don't think its my query sequences (besides, they were taken from NCBI and simply pasted to a text file).
BUT BUT BUT BUT!
If I build my database WITHOUT "-parse_seqids" parameter, IT WORKS! I also don't know exactly what the -parse_seqids does... The .fna file has the ">sequence info here" setup, and when doing the blastn search, it shows that sequence info in the results.. so I am not sure why that parameter is required?
Do you have spaces in the fasta headers descriptors? You possibly do. -parse_seqids is probably dropping text after first space making them non-unique. You can remove spaces in fasta headers (replace them _) and try again to see if the error goes away.
There are spaces yes. This is default from the NCBI assembly tho.
Here's the output now, when using "makeblastdb" this is the error:
BLAST Database creation error: Near line 1, the local id is too long. Its length is 183 but the maximum allowed local id length is 50. Please find and correct all local ids that are too long.
So I am guessing it was stopping after the first space.
Here is the first sequence in the file I am trying to make into a DB:
I really need all the info so I can't shorten the first line.. Also, I remember making a database out of the human genomic fasta I downloaded without any issues and that had spaces and such?
I see an *.ndb-lock file in the directories"xmc_cds_database.ndb-lock".
Although the command did work in computing platform, it failed in my pc(ubuntu18.04 LTS in win10).
and when i del the option" -parse_seqids" ,it works here.
[ws@BIG /mnt/g/project/sipanchong/NCBI_genome/xianmaochong]$ makeblastdb -dbtype nucl -in /mnt/g/project/sipanchong/NCBI_genome/xianmaochong/xianmaochong_cds.fa -title xmc_cds -out xmc_cds_database -parse_seqids
`Building a new DB, current time: 05/18/2020 22:59:03
New DB name: /mnt/g/project/sipanchong/NCBI_genome/xianmaochong/xmc_cds_database
New DB title: xmc_cds
Sequence type: Nucleotide
Deleted existing Nucleotide BLAST database named /mnt/g/project/sipanchong/NCBI_genome/xianmaochong/xmc_cds_database
Keep MBits: T
Maximum file size: 1000000000B
Adding sequences from FASTA; added 8056 sequences in 0.231933 seconds.
**terminate called after throwing an instance of 'lmdb::corrupted_error'
what(): mdb_dbi_open: MDB_CORRUPTED: Located page was wrong type
Aborted (core dumped)`**
I see an *.ndb-lock file in the directories"xmc_cds_database.ndb-lock".
Although the command did work in a computing platform, it failed in my pc(ubuntu18.04 LTS in win10).
and when I del the option" -parse_seqids", it works here.
So I didn't include -parse_seqids. And when I type the command: blastdbcmd -db RNA_HUMAN -entry NR_118908.1 I get: Error: [blastdbcmd] DB contains no accession info.
Out of curiosity I downloaded v2.9.0+ and and using that makeblastcmd everything works. Including the -parse_seqids. I can search and everything is as expected. I even reinstalled and it still didn't work.
So for now, I am going to use v2.9.0+. I may not in the future but I was right about -parse_seqids being the first line to each sequence identified by ">". This really isn't the best solution, so I am going to contact NCBI to see if they have any info for me.
EDIT WITH SOLUTION So, the real problem is due to: BLASTDB_LMDB_MAP_SIZE=1000000
This environment variable is necessary for the makeblastdb command. BUT, the larger your files the larger that size needs to be. If I set it to 1GB, everything works perfectly fine and runs and its just exactly perfect and what is expected.
Thanks all for the help and maybe this will help someone else
BLAST Database creation error: Near line 1, the local id is too long.
Its length is 183 but the maximum allowed local id length is 50.
Please find and correct all local ids that are too long.
Since the error message is clear you don't have any option but to see if you can stay in that limit. -parse_seqid flag is needed for associating the results with NCBI's taxonomic database. I don't think you need to do that so you could omit that option.
If you are trying align NGS data to transcriptome you could look into aligners like GMAP/GSNAP as an alternative.
Ahhhhh. I completely read that wrong. I thought you needed -parse_seqid for if the sequences had a local id. But that -parse_seqids is only needed to pair the local id with a supplied taxid file.. I see now!
Thank you thank you. So that fixes my error but doesn't make me understand why the .ndb-lock file would exist once the process is completed. But that may just be an "under the hood" kind of thing.
Do you have the error while making a blastn search? maybe the problem is with your query sequences rather than with your database ;)
Yes, it is when doing the "blastn" search, specifically, dc-megablast task. However, the error clearly states "BLAST Database error" :) so I don't think its my query sequences (besides, they were taken from NCBI and simply pasted to a text file).
BUT BUT BUT BUT!
If I build my database WITHOUT "-parse_seqids" parameter, IT WORKS! I also don't know exactly what the -parse_seqids does... The .fna file has the ">sequence info here" setup, and when doing the blastn search, it shows that sequence info in the results.. so I am not sure why that parameter is required?
Do you have spaces in the fasta headers descriptors? You possibly do.
-parse_seqids
is probably dropping text after first space making them non-unique. You can remove spaces in fasta headers (replace them_
) and try again to see if the error goes away.There are spaces yes. This is default from the NCBI assembly tho.
Here's the output now, when using "makeblastdb" this is the error: BLAST Database creation error: Near line 1, the local id is too long. Its length is 183 but the maximum allowed local id length is 50. Please find and correct all local ids that are too long.
So I am guessing it was stopping after the first space.
Here is the first sequence in the file I am trying to make into a DB:
I really need all the info so I can't shorten the first line.. Also, I remember making a database out of the human genomic fasta I downloaded without any issues and that had spaces and such?
I see an *.ndb-lock file in the directories"xmc_cds_database.ndb-lock". Although the command did work in computing platform, it failed in my pc(ubuntu18.04 LTS in win10).
and when i del the option" -parse_seqids" ,it works here.
what's wrong with my ubuntu
I see an *.ndb-lock file in the directories"xmc_cds_database.ndb-lock". Although the command did work in a computing platform, it failed in my pc(ubuntu18.04 LTS in win10).
and when I del the option" -parse_seqids", it works here.
what's wrong with my ubuntu