Wow, my first post on BioStars. I have been browsing through the archives for a while now, but the time has come where I can't find the answer I need and I have to actually ASK. Thanks in advance for all your advice!
So I am trying to compare 4 amino acid fasta files (.faa) to find the common set of proteins. I have a local database with all the protein sequences which I use to retrieve the sequences and convert the BLAST output into fasta format. Here's what I have so far:
Step 1.BLAST File_1 and File_2, outputting to a new file "hits1"
blastp -query file_2.faa -subject file_1.faa -outfmt "6 sseqid" -evalue 1e-20 >hits1
Step 2. I then take the output file, which contains only the IDs and convert it to fasta format.
blastdbcmd -db local_db -dbtype prot -entry_batch hits1 -outfmt %f -out results1.faa
Step 3. Then I can take this file and compare it to file_3. Repeat steps 2 and 3 until all the files are compared...
blastp -query file_2.faa -subject results1.fasta -outfmt "6 sseqid" -evalue 1e-20 >hits1
The problem I am having is that with each round of BLAST, duplicates of each ID are created. For example in file_1 (and in the database) the gene ID "003375649" is only present 1x. But the hits1 file (after running the first round of BLAST) contains this ID 2x. Then on the second round of BLAST (hits2) the output contains this ID 4x. This means that although the original file contains only 3000 protein sequences, the hits 2 file already contains >42,000 sequences and the BLAST takes for.ev.er to run! I want to know how to solve this problem but more importantly I am interested in knowing why the program is creating duplicates for my general knowledge.
Here's what I have tried: a. making more stringent e-values (1e-30), this does not help b. creating local databases from the .faa files themselves rather than one larger one and running blast like so:
makeblastdb -in file_1.faa -dbtype prot -out file_1_db -parse_seqids
blastp -query file_2.faa -db file_1_db -outfmt "6 sseqid" -evalue 1e-20 >results1.faa
this gives the exact same results as this first method. Soooo why the duplicates?
Have you tried adding "-max_target_seqs 1" to the blastp command?
A general remark on this: Do not use Blast tabular format as a primary output format of blast but use outfmt 11 (ASN.1) for long running jobs. This might be subject to debate whether this is such a strict rule, but at least for long running blast jobs, tabular format is an absolute NoGo. Tabular is lossy and you can convert ASN.1 into any other format using blast_formatter including tabular if you need.