FASTA-Reader: First data line in seq is about 100% ambiguous nucleotides (shouldn't be over 40%)
1
0
Entering edit mode
10.2 years ago
sfcarroll ▴ 80

I am not sure if this is a problem or if in fact the process is correct. Any help is much appreciated.

I am trying to make blast databases from assembly fasta files, and have seeing the above error. It generated blast database files but how do I know they are correct?

I followed these steps:

1) Downloaded assembly fasta file archive

site

http://hgdownload.cse.ucsc.edu/goldenpath/hg19/bigZips

file

http://hgdownload.cse.ucsc.edu/goldenpath/hg19/bigZips/chromFa.tar.gz

2) Unpacked the file

tar zyvf chromFa.tar.gz

3) Ran makeblastdb

/home/sean/blast/ncbi-blast-2.2.29+/bin/makeblastdb -dbtype nucl -title chr1.fa.blast -in ../chr1.fa -parse_seqids

4) Received an error

Building a new DB, current time: 09/04/2014 13:18:53
New DB name:   ../chr1.fa
New DB title:  chr1.fa.blast
Sequence type: Nucleotide
Deleted existing BLAST database with identical name.
Keep Linkouts: T
Keep MBits: T
Maximum file size: 1000000000B
Error: (1431.1) FASTA-Reader: Warning: FASTA-Reader: First data line in seq is about 100% ambiguous nucleotides (shouldn't be over 40%)
Adding sequences from FASTA; added 1 sequences in 20.1816 seconds.

5) Output files generated

-rw-rw-r-- 1 sean sean  62359693 Sep  4 13:19 chr1.fa.nsq
-rw-rw-r-- 1 sean sean        59 Sep  4 13:19 chr1.fa.nsi
-rw-rw-r-- 1 sean sean        18 Sep  4 13:19 chr1.fa.nsd
-rw-rw-r-- 1 sean sean        36 Sep  4 13:19 chr1.fa.nog
-rw-rw-r-- 1 sean sean        96 Sep  4 13:19 chr1.fa.nin
-rw-rw-r-- 1 sean sean        43 Sep  4 13:19 chr1.fa.nhr

6) The start of the assembly file does contain a lot of N's

➜  hg19  head chr1.fa
>chr1
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
blast • 6.0k views
ADD COMMENT
6
Entering edit mode
10.2 years ago

This is more a warning message with "Error" prepended than an actual error message (see the source code here). For something like the human genome, you're going to get this warning simply because the telomeres are hard masked. The resulting files should work regardless.

ADD COMMENT
0
Entering edit mode

Thanks, I thought so, just wanted to sanity check my process. I know the BLAST databases can be downloaded from the NIH, but I am just trying to own the process.

ADD REPLY
0
Entering edit mode

Hi Devon! I have the same issue has sfcaroll, except my sequences don't have a single "n" in them. Should I be concerned about this error?

ADD REPLY

Login before adding your answer.

Traffic: 3010 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6