Question

how to extract blastxed sequences' corresponding amino acid sequences

0

Entering edit mode

9.7 years ago

Kurban ▴ 230

Hello everyone,

I have two fasta files, one of them with a hundred thousand of nucleotide sequences in it, and other one with the same number of amino acid sequences in it.

I made the fasta file with amino acid sequences as a database and blastxed other file.

Here what I want to do is to extract the blastxed nucleotide sequences' corresponding translated amino acid sequences form blastx result file.

So could u suggest me some ways to do that?!

blastx sequence blast • 3.9k views

ADD COMMENT • link updated 2.5 years ago by Ram 44k • written 9.7 years ago by Kurban ▴ 230

0

Entering edit mode

Well you could use grep -A 100 URSEQID input.fa function where -A refers to the number of lines after the matched sequence ID or perl call:

perl -lne 'if(/>URSEQID/){$p=1; print $_}elsif($p==1){print $_}elsif(/>/ && p==1){last}' input.fa

or an awk command of you can import everything into a database and use sql queries. there are many ways to do this. The question is which OS are you running on, have your sequence ID's remained the same and could you provide a quick example of your files

ADD REPLY • link updated 2.5 years ago by Ram 44k • written 9.7 years ago by mxs ▴ 530

0

Entering edit mode

Hi @mxs,

My OS is ubuntu. The code I used is:

blastx -db Triboliumcastaneum_TFs.fasta -query gene.fa -out Triboliumcastaneum_TFs.fasta_blastx_1 -evalue 1e-5 -num_threads 3 -num_alignments 1 -outfmt 1

And the resultant file is:

BLASTX 2.2.28+

Reference: Stephen F. Altschul, Thomas L. Madden, Alejandro A.
Schaffer, Jinghui Zhang, Zheng Zhang, Webb Miller, and David J.
Lipman (1997), "Gapped BLAST and PSI-BLAST: a new generation of
protein database search programs", Nucleic Acids Res. 25:3389-3402.


Database: Triboliumcastaneum_TFs.fasta
           621 sequences; 327,131 total letters


Query= comp80_c0_seq1 len=592 path=[0:0-591]

Length=592

***** No hits found *****


Lambda      K        H        a         alpha
   0.318    0.134    0.401    0.792     4.96

Gapped
Lambda      K        H        a         alpha    sigma
   0.267   0.0410    0.140     1.90     42.6     43.6

Effective search space used: 35663040

Query= comp904_c0_seq1 len=4444 path=[0:0-131 132:132-135 136:136-158
159:159-198 7129:199-199 200:200-223 224:224-925 926:926-949
950:950-973 974:974-2445 2447:2446-2467 2469:2468-4006
4008:4007-4007 4009:4008-4088 4091:4089-4300 4303:4301-4443]

Length=4444
                                                                      Score     E
Sequences producing significant alignments:                          (Bits)  Value

  Triboliumcastaneum_TF256                                            1492    0.0  


Query_2  1125  NVSGNPIIKTESIKLPQQRKSARNVAKKVSYVELDSPKPSQKDGEELPKDVPVVGSAEPP  1304
255      4     DMRAK.CASAK.VLPSIT...Q.TAS.Q...T.V...SD.P.-----.D...A..NP..L  58

Query_2  1305  SVTLEDIYYISRENSAGCWIKARLKGVLPPGTLYNGVPSTASLYKVIELKQHKERIVDGK  1484
255      59    V.RIDEM........S.S.......R..Y...V..DN.C.TT...MV.....V..V....  118

Query_2  1485  NIASMKHSRIRLETGTRVIAIFKEIGPTGEIRKTRHDPYYPGIVAEPPVAQNKYRYLIFF  1664
255      119   ........T...........V.........A..I.YE.F...V.................  178

Query_2  1665  DIGYSQYVNYTSVHVVFEASKNVWEDMHVNSRPFIKKYLESPDRPMVKLSKGQTVRVEYN  1844
255      179   ...........N.............................A..........H.......  238

Query_2  1845  GKWVLCTIMNVDCSLVLVFFDSLNRMEWIYRGSTRLSPMFKEEQNAIKRHTGGRTSNRKT  2024
255      239   ......K..E........Y.E.........................T............Q  298

Query_2  2025  DPSESVTIVQYTKNSDFEKPAETIRTVRAVARKSTARPESSSQGGFGFVPFLPPVNNNVL  2204
255      299   G....A................NS...............NTTS...--.....Q...PPV  357
                                                                        \  
                                                                        |  
                                                                        I

Query_2  2205  AMEPTSKIMYFTPRDQVITQRLYRPHTCSPACKDYVRHNLSKLRGHNPLSKPLLCGWARI  2384
255      358   FS....................................S.G..................M  417

Query_2  2385  TQKVKGRKEIVYKAPCSRILRNMAEVHRYLRMTKSEMTVDLFDFNHMVRCLAEFNVECTA  2564
255      418   .L...............Q.....K...Q...V......................S...NP  477

Query_2  2565  DPKDLSKGLEQVPVPVINGINNEMLDFCNYATKRVPMEDVPLNTDPEFLCGCDCEDDCAD  2744
255      478   .............I...................................I....T...S.  537

Query_2  2745  KMKCQCWRLTLEGAKYMGKNVDPNSIGYIYRRLHEQVLTGIYECNARCKCGPTCLNRVVQ  2924
255      538   ....A..Q....................V................S....AA........  597

Query_2  2925  NPMSIKLQVFRTHNRGWGIRCVNDVPQGTFICIYAGTIHTEQMANEDGVTYGDEYFAELD  3104
255      598   ............................................................  657

Query_2  3105  YIENCEKHKEDYESDVVEPDSPRSRSITpeddeeeekekkpdGRRKRYKEVEDNDFCPSY  3284
255      658   ................N.........A.---P.D..E.QR.....R.WRHPQ....T...  715
                                                               \           
                                                               |           
                                                               Q

Query_2  3285  PFVSSDFENSMKLrrrreedrdkerekekeekearikkdekNSVMDGITTEIETVTISDD  3464
255      716   .L..G......R....K..EPKV.EKFF.K.EKIP---------LESV............  766

Query_2  3465  EDDPKEVLSFNPKSGSDLEENDKPKHVSVRTFFGEEEPYIMDTKNAGNIGRFLNHSCSPN  3644
255      767   ..................DD.-..SQ....S.............................  825

Query_2  3645  VFVQNVFVDTHDLRFPWVAFFCSQFIRAGTELTWNYNYDIGSVPGRVLYCHCGSLECKGR  3824
255      826   ............................................................  885

Query_2  3825  LL  3830
255      886   ..  887

Lambda      K        H        a         alpha
   0.318    0.134    0.401    0.792     4.96

Gapped
Lambda      K        H        a         alpha    sigma
   0.267   0.0410    0.140     1.90     42.6     43.6

Effective search space used: 380706976

Query= comp904_c0_seq2 len=1452 path=[4675:0-87 7837:88-89 7839:90-96
6352:97-97 6299:98-1223 5917:1224-1227 5921:1228-1451]

Length=1452

***** No hits found *****


Lambda      K        H        a         alpha
   0.318    0.134    0.401    0.792     4.96

Gapped
Lambda      K        H        a         alpha    sigma
   0.267   0.0410    0.140     1.90     42.6     43.6

Effective search space used: 112619160
 and want I want to extract is  query nucleotide sequences' blastxed corresponding translated amino acid sequences.

ADD REPLY • link updated 2.5 years ago by Ram 44k • written 9.7 years ago by Kurban ▴ 230

0

Entering edit mode

Aha. well, that is not what I understood. So you want

Query_2  1125  NVSGNPIIKTESIKLPQQRKSARNVAKKVSYVELDSPKPSQKDGEELPKDVPVVGSAEPP...

in nucleotide mode. and you want exactly the aligned region(1125-3824), right?

ADD REPLY • link updated 2.5 years ago by Ram 44k • written 9.7 years ago by mxs ▴ 530

0

Entering edit mode

If we could extract aligned region would be good too, but query nucleotide sequences' blastxed corresponding whole translated amino acid sequences could be extracted even better (for the exemplary sequences, not just 1125-3824 part, but whole 1-4444).

Could that be feasible?

ADD REPLY • link updated 2.5 years ago by Ram 44k • written 9.7 years ago by Kurban ▴ 230

Ram · Answer 1 · 2015-03-23

1

Entering edit mode

9.7 years ago

mxs ▴ 530

Ok,

First of all it would be much easier if tabular output format is used. That way only queries with hits would be reported. Next, you just execute this Perl one-liner and this should be it :

perl -lne 'if (/>(.*?)\s+/){(qx(grep "^$1\t" bxout))?($y=1):($y=0)} if($y==1){print $_}' fasta.input.fa > out.file.fa

This is similar to what suggested before, however this extracts all the hits.

fasta.input.fa is a nucleotide fasta file
bxout is a blast output in tabular format
out.file.fa is your output file

What the code does is it extracts fasta entries from your nucleotide file that are reported to have a hit with blast

cheers
mxs

ADD COMMENT • link updated 2.5 years ago by Ram 44k • written 9.7 years ago by mxs ▴ 530

0

Entering edit mode

Thank you @mxs,

Seems like I have bot been clear, I am sorry about that. but I wanted to extract is that those sequences (out.file.fa) corresponding translated amino acid sequences, which are aligned to the sequences in database file.

Like you said before, in this:

Query_2  1125  NVSGNPIIKTESIKLPQQRKSARNVAKKVSYVELDSPKPSQKDGEELPKDVPVVGSAEPP...
...
...
...
Query_2  3645  VFVQNVFVDTHDLRFPWVAFFCSQFIRAGTELTWNYNYDIGSVPGRVLYCHCGSLECKGR  3824
Query_2  3825  LL  3830

not the aligned region(1125-3830), but it's translated amino acid sequences:

1125  NVSGNPIIKTESIKLPQQRKSARNVAKKVSYVELDSPKPSQKDGEELPKDVPVVGSAEPP...
.
.
.
VFVQNVFVDTHDLRFPWVAFFCSQFIRAGTELTWNYNYDIGSVPGRVLYCHCGSLECKGR
LL  3830

If they are in fasta file format, that would be great.

ADD REPLY • link updated 2.5 years ago by Ram 44k • written 9.7 years ago by Kurban ▴ 230