Hi, I am trying to find similarities between DNA sequences against a chromosomal sequence with the following command line (version 2.6.0+):
blastn -subject chr7.fasta -query seq.fasta -out out.txt -outfmt "6 qseqid sseqid pident qlen length qstart qend sstart send"
An example of output (after pretreatment):
Accnum Allele Label %Identity Sstart Send
Accnum1 Allele1 LL 100.000 99928 100090
Accnum1 Allele1 LP 100.000 99928 99976
Accnum1 Allele1 LVR 99.796 99928 100416
Accnum1 Allele1 VE 99.661 100083 100377
Accnum1 Allele1 VGU 99.701 100083 100416
Accnum1 Allele1 L 99.661 100083 100090
Accnum1 Allele1 VR 99.652 100091 100377
Accnum1 Allele1 R 100.000 100378 100416
Now, I would like to retrieve the subject sequence for each match, let say between 97 and 100% of identity for VE and VR labels, the %identity of the others must be equal to 100%. I added sseq as option in my outfmt and figure out that it is the aligned sequence as explained in blastn doc (with plenty '-' though).
1) Is there a possible way to get the subject sequence with the blastn command line at the same time?
2) With a command line, is there a way to blast my sequences against a specific chromosome without having it in local?
3) In this process, I used 'awk' and 'cut' to get the output (above), I removed and picked the information I need in the sequences fasta head. After that, I would like to check the Sstart (subject start position) and Send (subject end position) according to specific criteria for labels. For example:
LL(Sstart) = LP(Sstart) = LVR(Sstart)
LVR(Send) = VGU(Send) = R(Send)
L(Send) = LL(Send) = VR(Sstart - 1)
VR(Send) = VE(Send) = R(Sstart + 1) etc.
What is the best way to? Is there a faster way to do that in awk or consider to write a bash script? What about samtools?
Any help and advice will be highly appreciated.
Thanks in advance.
Non that I am aware of, but you can easily get the subject sequence ID so that you can write a script to extract the sequence of that ID from your set of sequences. See
sseqid
.Again, not that I am aware of, but making a database on some downloaded set of sequences in FASTA format is really quick, so you could give that a try first! Even if I assume you already tried and you're asking because of a problem.
I would personally import the table in R, do the transposed matrix of the one you show, with
%Identity Sstart Send
as row names andLL, LP, LVR, ...
as column names, so you can sum up numbers within a single row for each calculation you need.Thank you for your answer.
2)
In fact, I had the chromosomal sequence in local and had no specific problem for that. I was thinking about a semi-automatic bash script for these main steps:
a. retrieve the chromosomal sequence (or any from NCBI)
b. do a blasnt against my fasta sequences
c. filter the results according to my needs
3)
The matrix you mention could also be built in awk by selecting cols 3,4,5 and 6. In fact, some sums concern different rows. For example:
LL(Send) = VR(Sstart - 1).
LL/VR: For LL it is col 6 and VR col 5 (100090 - 1000091 = -1).
In my case, I have to keep col 1 and 2 in order not to loose the accession numbers and genes information if the same gene has different accession numbers. Did you have a specific idea when you suggested R?
What I usually do is downloading the fasta file with all the sequences that fit my query. If you find a way to automatize a query to NCBI nucleotide (or protein)... I don't know one right now but I believe one exists.
No, I'm just more comfortable with R when working with rows and columns.
If you only want the aligned portion, then add
sseq
to your outfmt.Thank you for your answer.
Yes, I wanted the aligned portion. This is why I meantionned "sseq" above.
Using sseq in outfmt is not the solution to my issue: you get the sequence with non-aligned nucleotids and not the complete sequence.
Illustrative example:
atgatgatgat--gatag-c instead of let say atgatgatgatGCgatagTc
I don't understand your example...
What you want looks like the
sseq
to me... How is that different?I gave above a nucleotid sequence example to clarify my need.
Using sseq, you obtain:
The subject sequence expected:
You do not obtain the whole subject sequence aligned, there are '-' chars in its sequence
Ok. Given that perhaps Blast won't put the whole sequence in the output, can't you just integrate it yourself with a line of code using the
sseqid
to read back your sequence from the original file?Well there are '-' chars in the subject sequence whenever there's an insert in the query seq... if you want the full sseq then just remove the '-' chars. You could also take qseq (but then you will have the gap chars whenever there is an insert in the subject seq). Where do you expect your "GC" and "T" to come from?
Or are the different parts separated by '-' supposed to be individual hsps? Then you're right, you wont have the parts aligned to a gap, but then your only option is to chain together the hsps and missing portions with a script.
If I don't remember wrongly, "-" in the
sseq
output field represent both indels and mismatches. Removing them won't work.You remember wrongly. '-' are just indels. Mismatches are not shown in qseq/sseq.
@cschu1981: I thought that "GC" and T" are the nucleotids replacing the "-" chars. I could just give another nt I mean It was just an illustrative example.
OK. So if I understand the real subject sequence length we are expecting to obtain with sseq (in my example above) will be 16 nt after removing the '-' chars, so simply by doing:
Is it right?