Question

Makeblastdb Error

3

Entering edit mode

11.0 years ago

mafrecastigris ▴ 30

I've download a cds (coding sequence) databse from Ensembl in order to make a BLASTN and I am now trying to use makeblastdb of this database using the following command:

makeblastdb -in Ory_cun_cds_5.fasta -input_type fasta -dbtype nucl

And I've been obtaining the following:

Error: (1431.1) FASTA-Reader: Warning: FASTA-Reader: First data line in seq is about 63% ambiguous nucleotides (shouldn't be over 40%)

I believe that this is due to the amount of N's in the sequences that I can find by making a quick search, butI I am not sure. Even if this is the case, should I delete this sequences? Does anyone know how to solve this?

makeblastdb blastn error ensembl • 9.7k views

ADD COMMENT • link updated 10.8 years ago by bioinforupesh2009.au ▴ 150 • written 11.0 years ago by mafrecastigris ▴ 30

score 3 · Answer 1 · 2014-03-05

There are two possibilities here:

You transferred the file from a mac to a unix system, or a windows to a unix system or similar, and that screwed up the symbols for line endings - in that case, makeblastdb is errroneously loading your sequence names as actual sequences and thinks that the H, B, D, X, N etc. in the name are ambiguous IUPAC codes (as listed here). In that case, you should fix your line-endings; check if you have "\n", "\r", "\n\r" at the end of your lines, I don't know on what OS you are so I don't know how, using Python I would just do .rstrip("\r\n") on all lines and add a "\n". Linux also has commands like mac2unix in case you know where your file comes from. Also check if you don't have any accidental linebreaks inside your sequence names.
There are actually too many ambiguous codes in one of your sequences, in which case I would remove the sequence in question. If it's so ambiguous, you can't align anything to it - why would you?

But I never had that error so I just tried it with this file and blast 2.2.26:

>first
XBDHXBHDXBHDBNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNnHBHXBHBHD
>second
ATGCNNNNNNNNNNNNNNNNNNNNNNNATGC
>third
ATCGTCGATCGATGCTAGTAGCTAGCATGCTAGCA

with the command makeblastdb -in test.fasta -dbtype nucl -input_type fasta and that worked fine... which makes me think the first possibility is more likely.

Output:

Building a new DB, current time: 03/06/2014 10:35:12
New DB name:   test.fasta
New DB title:  test.fasta
Sequence type: Nucleotide
Keep Linkouts: T
Keep MBits: T
Maximum file size: 1073741824B
Adding sequences from FASTA; added 3 sequences in 0.123005 seconds.

Ram · Answer 2 · 2014-05-16

I blogged about this problem http://www.acgt.me/blog/2014/5/15/fun-with-an-error-message-from-ncbi-blast

I think the point of this message is to warn you that if your sequence is highly enriched for ambiguous characters (X for proteins, N, R, S etc. for DNA) then you probably won't be able to make much of a useful BLAST database from it. However, the warning gets prefixed with 'Error' to make it sound worse than it is.

It only check the first line of a sequence. So if your sequence has >40% Xs on the first line and then no more ambiguous characters for the rest of the sequence, you will still see this error. The code behind it triggers as long as there is 5 characters on the first line. So if the first line of your sequence just starts with 'XXXSW', you will see the error.

More importantly, you should be independently checking your sequences to see how many ambiguous characters there are. Too many people send sequences to all manner of bioinformatics tools, even if they won't be able to do anything useful because of the high content of Xs or Ns.

score 0 · Answer 3 · 2014-03-06

Second possibility: Now I'm puzzled because if I had to guess the reason why this happens it would be your second possibility! I've checked manually and I do find some sequences why a large amount of N's! I guess I'll try to cp them into a new file and try to makeblastdb with only them and see if it works.

First possibility: I also thought about your first guess but I had excluded it because of two things:

I have another data set of peptides that I downloaded from Ensembl and another from NCBI and that I transferred to Unix in the same way I transferred the cds data set (Download to Windows and transferred to Unix via WindSCP) and they both worked fine.
When I have this problem with scripts that I create in Windows and then transfer to Unix, I open a new file in Unix and cp the content of the script I created in Windows into this files and it usually works fine. I did this with the data set and it doesn't work.

I don't really know Python (very new at this) but I guess I could ask for help to use your suggestion with .rstrip. I don't find any symbols like "\n", "\r", "\n\r" in the end of lines. I do have some extra empty lines between some sequences (between the end of one and the beginning (the > symbol) of the next) but this also happens in the peptide data set so I'm guessing that this isn't the problem. Would add an asterisk also mark the end of the sequence? There is an asterisk in every end of the peptide sequence in my peptide reference data set. I excluded this because I tried to makeblastdb with only the first three sequences of the cds data set and I could do it just fine. .

What I'm thinking: I was thinking about deleting the sequences with 40% or more ambiguous bases (that is, N's) to solve this problem but I will try your suggestions first and see if they work

Thanks a lot for your answers!

Ram · Answer 4 · 2014-06-11

0

Entering edit mode

10.8 years ago

bioinforupesh2009.au ▴ 150

Dear all,

I am still stuck with same problem and not getting any idea to resolve it

I checked back to my seq and i dont have any ambiguous nucleotides such as N, Y, W etc

Second possibility I try to remove \r\n (might be by conversion problem dos2unix or mac2unix)

by dos2unix program

dos2unix -c mac input.fasa #(but not worked same error ??)
dos2unix -n input.fasta output.fasta

also used

tr '\r' '\n' <input> output #(but not worked same error ??)

but the problem is remain same

Error: (1431.1) FASTA-Reader: Warning: FASTA-Reader: First data line in seq is about 42% ambiguous nucleotides (shouldn't be over 40%))?????

Please its humble request you all, suggest me something that how to over come this problem.

The seq I downloaded from miRBase mature.

ADD COMMENT • link updated 5.2 years ago by Ram 45k • written 10.8 years ago by bioinforupesh2009.au ▴ 150

1

Entering edit mode

Hello folks,

I got it. it was problem of RNA, its only take T (i.e DNA) instead U (RNA). i just simply replaced U into T and its work !!!

ADD REPLY • link 10.8 years ago by bioinforupesh2009.au ▴ 150

0

Entering edit mode

I encountered the same problem, but with DNA sequences. Error or Warning message is always distracting, but it seems that my file has been masked correctly.

ADD REPLY • link 10.7 years ago by boczniak767 ▴ 870