Question

FASTA-Reader: First data line in seq is about 100% ambiguous nucleotides (shouldn't be over 40%)

0

Entering edit mode

10.6 years ago

sfcarroll ▴ 80

I am not sure if this is a problem or if in fact the process is correct. Any help is much appreciated.

I am trying to make blast databases from assembly fasta files, and have seeing the above error. It generated blast database files but how do I know they are correct?

I followed these steps:

1) Downloaded assembly fasta file archive

site

http://hgdownload.cse.ucsc.edu/goldenpath/hg19/bigZips

file

http://hgdownload.cse.ucsc.edu/goldenpath/hg19/bigZips/chromFa.tar.gz

2) Unpacked the file

tar zyvf chromFa.tar.gz

3) Ran `makeblastdb`

/home/sean/blast/ncbi-blast-2.2.29+/bin/makeblastdb -dbtype nucl -title chr1.fa.blast -in ../chr1.fa -parse_seqids

4) Received an error

Building a new DB, current time: 09/04/2014 13:18:53
New DB name:   ../chr1.fa
New DB title:  chr1.fa.blast
Sequence type: Nucleotide
Deleted existing BLAST database with identical name.
Keep Linkouts: T
Keep MBits: T
Maximum file size: 1000000000B
Error: (1431.1) FASTA-Reader: Warning: FASTA-Reader: First data line in seq is about 100% ambiguous nucleotides (shouldn't be over 40%)
Adding sequences from FASTA; added 1 sequences in 20.1816 seconds.

5) Output files generated

-rw-rw-r-- 1 sean sean  62359693 Sep  4 13:19 chr1.fa.nsq
-rw-rw-r-- 1 sean sean        59 Sep  4 13:19 chr1.fa.nsi
-rw-rw-r-- 1 sean sean        18 Sep  4 13:19 chr1.fa.nsd
-rw-rw-r-- 1 sean sean        36 Sep  4 13:19 chr1.fa.nog
-rw-rw-r-- 1 sean sean        96 Sep  4 13:19 chr1.fa.nin
-rw-rw-r-- 1 sean sean        43 Sep  4 13:19 chr1.fa.nhr

6) The start of the assembly file does contain a lot of N's

➜  hg19  head chr1.fa
>chr1
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN

blast • 6.2k views

ADD COMMENT • link updated 3.3 years ago by Ram 45k • written 10.6 years ago by sfcarroll ▴ 80

Ram · Accepted Answer · 2014-09-04

6

Entering edit mode

10.6 years ago

Devon Ryan 105k

This is more a warning message with "Error" prepended than an actual error message (see the source code here). For something like the human genome, you're going to get this warning simply because the telomeres are hard masked. The resulting files should work regardless.

ADD COMMENT • link 10.6 years ago by Devon Ryan 105k

0

Entering edit mode

Thanks, I thought so, just wanted to sanity check my process. I know the BLAST databases can be downloaded from the NIH, but I am just trying to own the process.

ADD REPLY • link updated 3.3 years ago by Ram 45k • written 10.6 years ago by sfcarroll ▴ 80

0

Entering edit mode

Hi Devon! I have the same issue has sfcaroll, except my sequences don't have a single "n" in them. Should I be concerned about this error?

ADD REPLY • link 8.3 years ago by catarina.fa • 0

1) Downloaded assembly fasta file archive

2) Unpacked the file

3) Ran makeblastdb

4) Received an error

5) Output files generated

6) The start of the assembly file does contain a lot of N's

3) Ran `makeblastdb`