I am trying to create a local database for a crop with a genome size of 13 Gb. The genome sequence is unpublished but its currently being assembled in my group. I have the fasta file for the genome sequence and the crop has six chromosomes with a size of about 2 Gb per chromosome. Also, the current version of ncbi-blast+ is installed in the Linux server I am working on and I have successfully created the database using the "makeblastdb" command.
There are thousands of sequences which I would like to align to the local database and have formatted the input file accordingly. However, when I run the "blastn" command for the alignment, the process gets "KILLED" on the server. I am sure the problem is not related to the script however.
I tried to run this on my windows laptop using "Git Bash"; it runs for over 45 minutes, producing a little output, but crashes the system after using up the memory for the task. A colleague suggested that I run the task as an "sbatch" script on the server which I did, partitioning enough memory (upto 500G) for the run. Unfortunately, it returned an error file after some time saying "Segmentation fault (core dumped) due to out-of-memory event...." I have colleagues who worked with similar genome size and did not encounter any problem related to this, however not with chromosomes as large as this. Could this be related to the chromosome size? Please, can anyone suggest how I can handle this?
Sorry this isn't a solution, but I am guessing it is a configuration issue b/c 500G of memory for a 13GB database sounds more than sufficient.
I would focus in on the
Segmentation fault
error. Two things come to mind,ulimit -f -n
in git bash and consider changingfile size
to beulimited
.open files
can also be an issue, but I'm guessing it is set to a number more than the number of DB files for your 13 GB database. There's a bit more info on this in the "Memory Usage" section of the blast helpGood luck!
Thank you for your suggestion. I thought the 500G memory was sufficient too. But I think the problem might be associated with the reference genome. I tried out different stuffs to figure out why it was not working.
First, I took out a single chromosome from the reference genome file and created a database with it, then aligned the sequences to this database. I encountered a similar problem as before. So I thought the problem may probably be due to the chromosome size. So, I wrote a script which would split the chromosomes into two sequences (a & b) and concatenate them into one single fasta file for creating a database. In this case, I would have the chromosomes in the reference file as chr1a, chr1b, chr2a, chr2b, chr3a, chr3b, chr4a, chr4b, chr5a, chr5b, chr6a, chr6b. With this new reference file, I created a database and aligned the query sequences. This time, there was no error so I got an output. However, when I checked the output file, I discovered that some of the query sequences were trimmed (shorter than in the query file) and in some cases, had 100% identity hits when the trimmed region included the SNP.
I was not sure why this was happening, instead I tried to work around the original reference genome file. I have about 60K sequences in the query file. After trying out several query files (by reducing the number of query sequences in the file), <50 query sequences seems to work. Unfortunately, in the output, the query sequences were trimmed too, some down to 28 nucleotides, instead of 71.
Any idea why so or how to fix this?
PS: the query sequences are from the same species.
I'm wondering if you have solved the problem by reducing the reference file size, but by nature of the query/reference - you are not getting back that many results. For the reduced split reference files, i.e.
chr1a
,chr1b
... -How big are the files for each chromosome by the way? As reference, the preformatted blast for the
ref_euk_rep_genomes
DB files are ~2.5-3GB. Are thechr1
files greater than this and thechr1a
/chr1b
files less than this?And one final thing to check the reference, although it's been reported to be a bit inconsistent, there is a script that "checks" the reference DB being used in the downloaded blast+ package. Maybe try that to see if it reports anything of concern about the reference?
./ncbi-blast-2.12.0+/bin/blastdbcheck
good luck
Reducing the reference size solved the problem of the run experiencing a memory problem (oom-kill event). I think this was because the chromosome sequences were too large and required so much memory to carry out the alignment. Anyways, using fewer query sequences (< 50) also avoided this error with the original reference file. The challenge I am facing now is the poor alignment.
All the query sequences work, but I have to run them in batch when working with the original reference. If not, it will encounter an oom-kill error. However, with the reduced reference file I can run all the 60K query sequences at once.
Yes, the regions excluded contain the SNP and the coverage is not 100%.
Each chromosome is ~1.9GB.
I tried this and I got the report below
*Writing messages to <stdout> at verbosity (Summary) ISAM testing is ENABLED. Legacy testing is DISABLED. TaxID testing is DISABLED. By default, testing 200 randomly sampled OIDs. Testing 4 volume(s).
/../ MetaData: [ERROR] caught exception.
/../ MetaData: [ERROR] caught exception.
/../ MetaData: [ERROR] caught exception.
/../ MetaData: [ERROR] caught exception.
Result=FAILURE. 4 errors reported in 4 volume(s). Testing 1 alias(es). Result=SUCCESS. No errors reported for 1 alias(es). Total errors: 4*
Is there an issue with the database?
I'm not sure if there's an issue with the database, or the script just reached an unexpected exception. The BLAST support team might appreciate a bug report though. Maybe reach out to them w/ the issue and attach relevant error output files. They provide an email here.
Okay, thank you for the useful tips.
no problem, good luck!