I have to generate multiple sequence alignment from the BLAST hit amino acid sequences. I did tBLASTn and extracted BLAST hit sequences by using the following commands,
tblastn -query xac.fasta -db t_db -outfmt '6 sseqid qseqid qseq evalue' -out msa.xls
awk 'BEGIN { OFS = "\n"} {print ">" $1 $2, $3}' msa.xls > msa1.fasta
But, the output file msa.xls printed special characters (------- *) in between (BLAST hits) amino acid sequences. If it DNA sequence, I could consider * as N, and go ahead of multiple sequence alignment. But in terms of amino acid sequences, N meant for Asparagine (single letter amino acid code). Therefore, how could I carryout multiple sequence alignment for these ( ---) special characters interrupted amino acid sequences. To get rid of this, I have printed only BLAST hit coordinates of the concerned query in BLAST output by using the following command,
tblastn -query xac.fasta -db t_db -outfmt 6 -max_target_seqs 1 -out type2.xls
But, here I have to extract coordinate based BLAST hits. Samtools can be done for single coordinate extraction, but I have to extract the sequences for more than 50 - 60 coordinates. Please help me to solve the same. Thank you in advance.
Dear herve, Please find few lines of msa.xls here,
PDRGLVMVG-ADMATFDMYLLRSEV-VLLCTGRPEDITLDGWFRLEP*RQQRYATRAELIASTQEVPHAARP*SH
Please use 'code sample' fields while pasting pieces of code to make your post more readable.
Could you had some lines of
msa.xls
in your post pleaseYou do understand
qseq
is for aligned part of query sequence. The hit would besseq
, i.e. aligned part of the subject sequence.So anyway if that's what you want, you can do like this to remove "-" and "*"
If i remove those - and * special characters from the amino acid sequences, it may create negative effects on my further analysis. My intention is to know, what is meant for * special character, Since I know - meant for gaps. Removing those special characters only is the solution, then its not the right one to do. Could you please suggest me any tool or script to extract the multiple BLAST hit coordinates from the query contigs.
You don't need a script, just check from
blastn -help
which fields you want in your output, I'm guessing it's some of these:Dear 5heikki, You have specified BLAST output headers. But, I need to extract the BLAST hit query sequences, based on qstart and qend in the BLAST output, I could use the samtools commands as given below,
But, in samtools, only one set of coordinates can be extracted(qstart to qend). My intention is to extract the query sequences based on the BLAST hit output coordinates (qstart to qend) of multiple BLAST results. Because I have been doing BLAST analysis for single query against multiple subject databases, for example, virulence db, enzymes db etc. Please help me to do the same.
Thank you so much @5heikki. It works perfectly and this is what I exactly need to do. Once again thank you.
Dear 5heikki, I have to specify few parameters/cut-off in the BLAST command line such as,
Is it possible to add the same in the BLAST command line.
Check from
blastn -help
the fields you need and pipe the blast output intoawk
and use it for filteringawk 'BEGIN{OFS=FS="\t"}{if($x>y && $z>w){print $a, $b, $c}}'
where the letters correspond to field/column numbers and your desired cutoff values..