Hi everyone :)
I did a genome-guided transcriptome assembly of my human cell lab strain. As "genome" I used the human transcriptome (160987 sequences), which is provided by NCBI. Next, I wanted to blast my resulting lab strain transriptome fasta file (259991 sequences) against the very same NCBI human transcriptome fasta, i used for guiding the transcriptome assembly. Thus, I created a blast database of the NCBI fasta, using makeblastdb of NCBIs blast-2.9.0+. I did two independant blastn rounds, in which I used the option -num_alignments to have a small output, containing only one and three top hits per sequence, respectively.
blastn -query $Path$Query -db $Database -outfmt "6 sallacc" \
-out resultstab.txt -word_size 20 -evalue 0.000000000000001 \
-num_threads 8 -num_alignments 1
Subsequently, I used the "uniq" option, to remove all sequences from the output file, which appear twice or more times.
uniq resultstab.txt >resultstab2.txt
And I extracted the sequences:
blastdbcmd -db $Database -dbtype nucl -entry_batch resultstab2.txt \
-outfmt "%f" -out hitcontigs.fas -line_length 1000000
Now, my output fasta for the 3-top-hits-approach contains 472994 and the fasta for the 1-top-hit-approach contains 145562 fasta sequences. How can the output files contain more sequences than the reference file, I blasted against? What am I overseeing?
Thanks a lot in advance :)
because some (all?) of the input transcripts has more than one hit?
(remember that
-num_alignments
does not mean it only reports one hit)moreover
uniq
will only report as redundant if the complete line is identical, not only the IDs or such) (if you want unique query-hit pairs you will first need to extract column 1 & 2 from the output)Thanks a lot for your fast and helpful answer :) Now as you write it, I remember that I read something about -num_alignments not reporting one hit before I started the run.
So what you´re saying is that I should extract the sequence-ID columns of the "resultstab.txt" table before running uniq, because other values of the hit (e.g. alignment length, score etc.) might differ? And then run the uniq and finally the sequence extraction on that? That makes sense :) Would you use awk for that? Sorry I´m pretty new in the whole field...
BLAST is a local alignment program so it is going to report high-scoring pairs (HSPs), segments of sequences that align well. You may have multiple HSP's being reported to multiple hits so trying to summarize the hits is going to be tricky at best within the constrains (e.g.
num_alignments
) you are using.indeed, and yes,
awk
is an option but you can also docut -f1,2 <blast output file> | uniq
to get the number of unique query-hit pairsKeep in mind that
uniq
works only on previously sorted columns. If a number or string is repeated many times but not-consecutively (not sorted), all occurrences will be kept.absolutely true but the blast output is sorted so that's fine in this case.
but to be absolutely sure your indeed better run
cut -f1,2 <blast output file> | sort | uniq