Puzzling NCBI BLAST+ output after transcriptome assembly
1
0
Entering edit mode
3.8 years ago
ella • 0

Hi everyone :)

I did a genome-guided transcriptome assembly of my human cell lab strain. As "genome" I used the human transcriptome (160987 sequences), which is provided by NCBI. Next, I wanted to blast my resulting lab strain transriptome fasta file (259991 sequences) against the very same NCBI human transcriptome fasta, i used for guiding the transcriptome assembly. Thus, I created a blast database of the NCBI fasta, using makeblastdb of NCBIs blast-2.9.0+. I did two independant blastn rounds, in which I used the option -num_alignments to have a small output, containing only one and three top hits per sequence, respectively.

blastn -query $Path$Query -db $Database -outfmt "6 sallacc" \
  -out resultstab.txt -word_size 20 -evalue 0.000000000000001 \
  -num_threads 8 -num_alignments 1

Subsequently, I used the "uniq" option, to remove all sequences from the output file, which appear twice or more times.

uniq resultstab.txt >resultstab2.txt

And I extracted the sequences:

blastdbcmd -db $Database -dbtype nucl -entry_batch resultstab2.txt \
  -outfmt "%f" -out hitcontigs.fas -line_length 1000000

Now, my output fasta for the 3-top-hits-approach contains 472994 and the fasta for the 1-top-hit-approach contains 145562 fasta sequences. How can the output files contain more sequences than the reference file, I blasted against? What am I overseeing?

Thanks a lot in advance :)

blast ncbi fasta transcriptome trinity • 1.1k views
ADD COMMENT
2
Entering edit mode

because some (all?) of the input transcripts has more than one hit?

(remember that -num_alignments does not mean it only reports one hit)

moreover uniq will only report as redundant if the complete line is identical, not only the IDs or such) (if you want unique query-hit pairs you will first need to extract column 1 & 2 from the output)

ADD REPLY
0
Entering edit mode

Thanks a lot for your fast and helpful answer :) Now as you write it, I remember that I read something about -num_alignments not reporting one hit before I started the run.

So what you´re saying is that I should extract the sequence-ID columns of the "resultstab.txt" table before running uniq, because other values of the hit (e.g. alignment length, score etc.) might differ? And then run the uniq and finally the sequence extraction on that? That makes sense :) Would you use awk for that? Sorry I´m pretty new in the whole field...

ADD REPLY
0
Entering edit mode

BLAST is a local alignment program so it is going to report high-scoring pairs (HSPs), segments of sequences that align well. You may have multiple HSP's being reported to multiple hits so trying to summarize the hits is going to be tricky at best within the constrains (e.g. num_alignments) you are using.

ADD REPLY
0
Entering edit mode

indeed, and yes, awk is an option but you can also do cut -f1,2 <blast output file> | uniq to get the number of unique query-hit pairs

ADD REPLY
0
Entering edit mode

Keep in mind that uniq works only on previously sorted columns. If a number or string is repeated many times but not-consecutively (not sorted), all occurrences will be kept.

ADD REPLY
0
Entering edit mode

absolutely true but the blast output is sorted so that's fine in this case.

but to be absolutely sure your indeed better run cut -f1,2 <blast output file> | sort | uniq

ADD REPLY
1
Entering edit mode
3.8 years ago
GenoMax 148k

Assuming your assembly has worked well you may want to try CD-HIT-2D (LINK) to assess redundancy between your transcript assembly and the one that you downloaded from NCBI). You may find that you have a lot of duplicates in your assembly.

ADD COMMENT
0
Entering edit mode

Cool, thanks for pointing to that! I will definitely try it :)

ADD REPLY

Login before adding your answer.

Traffic: 2483 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6