extract postion motif from fasta file
1
0
Entering edit mode
6.7 years ago
Jason ▴ 10

Use shell command or python

Suppose I have two files. The first file has more than 100 list of fasta file. The second file has list of motifs. I want to extract the postion of each string in the motifs and save that in txt file 3.

Forexample:

File 1:

>sp|P26140|3BHS2_MOUSE 3 beta-hydroxysteroid dehydrogenase/Delta 5-->4-isomerase type 2 OS=Mus musculus GN=Hsd3b2 PE=1 SV=4

MPGWSCLVTGAGGFLGQRIIQLLVQEEDLEEIRVLDKVFRPETRKEFFNLETSIKVTVLE

GDILDTQYLRRACQGISVVIHTAAIIDVTGVIPRQTILDVNLKGTQNLLEACIQASVPAF

IFSSSVDVAGPNSYKEIVLNGHEEECHESTWSDPYPYSKKMAEKAVLAANGSMLKNGGTL

QTCALRPMCIYGERSPLISNIIIMALKHKGILRSFGKFNTANPVYVGNVAWAHILAARGL

RDPKKSPNIQGEFYYISDDTPHQSFDDISYTLSKEWGFCLDSSWSLPVPLLYWLAFLLET

VSFLLSPIYRYIPPFNRHLVTLSGSTFTFSYKKAQRDLGYEPLVSWEEAKQKTSEWIGTL

VEQHRETLDTKSQ

>sp|P35730|ODBB_RAT 2-oxoisovalerate dehydrogenase subunit beta, mitochondrial OS=Rattus norvegicus GN=Bckdhb PE=1 SV=3

MAAVAARAGGLLRLGAAGAERRRRGLRCAALVQGFLQPAVDDASQKRRVAHFTFQPDPES

LQYGQTQKMNLFQSITSALDNSLAKDPTAVIFGEDVAFGGVFRCTVGLRDKYGKDRVFNT

PLCEQGIVGFGIGIAVTGATAIAEIQFADYIFPAFDQIVNEAAKYRYRSGDLFNCGSLTI

RAPWGCVGHGALYHSQSPEAFFAHCPGIKVVIPRSPFQAKGLLLSCIEDKNPCIFFEPKI

LYRAAVEQVPVEPYKIPLSQAEVIQEGSDVTLVAWGTQVHVIREVASMAQEKLGVSCEVI

DLRTIVPWDVDTVCKSVIKTGRLLISHEAPLTGGFASEISSTVQEECFLNLEAPISRVCG

YDTPFPHIFEPFYIPDKWKCYDALRKMINY

File 2:

Motif

P26140      MPGWSC

P35730     AERRRRGLRCAAL

File 3

Result:

P26140       1,2,3,4,5

P35730      19,20,21,22,23,24,25,27,28,29,30,31
sequence • 1.8k views
ADD COMMENT
1
Entering edit mode

If this is not an assignment then you can use fuzzpro from EMBOSS. Second example from above.

########################################
# Program: fuzzpro
# Rundate: Tue 13 Mar 2018 00:02:50
# Commandline: fuzzpro
#    -auto
#    -sequence /var/lib/emboss-explorer/output/289888/.sequence
#    -pattern AERRRRGLRCAAL
#    -outfile outfile
#    -rformat2 seqtable
# Report_format: seqtable
# Report_file: outfile
########################################

#=======================================
#
# Sequence: ODBB_RAT     from: 1   to: 390
# HitCount: 1
#
# Pattern_name Mismatch Pattern
# pattern             0 AERRRRGLRCAAL
#
#=======================================

  Start     End Pattern               Mismatch Sequence
     19      31 pattern:AERRRRGLRCAAL        . AERRRRGLRCAAL

#---------------------------------------
#---------------------------------------

#---------------------------------------
# Total_sequences: 1
# Total_length: 390
# Reported_sequences: 1
# Reported_hitcount: 1
#---------------------------------------
ADD REPLY
0
Entering edit mode

I want to read many patterns with list of sequences. This software will only read one pattern for each run.

thank you

ADD REPLY
1
Entering edit mode

If you want python, look into regex.findall()

ADD REPLY
0
Entering edit mode

This may be a good time to learn regex, my friend. If this is for an assignment then I think you will learn the most that way: https://regexr.com/

ADD REPLY
2
Entering edit mode
6.7 years ago

Motifs (tab separated):

$ cat pat.txt 
P26140  MPGWSC
P35730  AERRRRGLRCAAL

output using seqkit and csvtk (csvtk for formatting the output):

$ seqkit tab2fx pat.txt | seqkit locate -f - test.fa | csvtk -t pretty
seqID                   patternName   pattern         strand   start   end   matched
sp|P26140|3BHS2_MOUSE   P26140        MPGWSC          +        1       6     MPGWSC
sp|P35730|ODBB_RAT      P35730        AERRRRGLRCAAL   +        19      31    AERRRRGLRCAAL

Input:

$ cat test.fa 
>sp|P26140|3BHS2_MOUSE 3 beta-hydroxysteroid dehydrogenase/Delta 5-->4-isomerase type 2 OS=Mus musculus GN=Hsd3b2 PE=1 SV=4
MPGWSCLVTGAGGFLGQRIIQLLVQEEDLEEIRVLDKVFRPETRKEFFNLETSIKVTVLE
GDILDTQYLRRACQGISVVIHTAAIIDVTGVIPRQTILDVNLKGTQNLLEACIQASVPAF
IFSSSVDVAGPNSYKEIVLNGHEEECHESTWSDPYPYSKKMAEKAVLAANGSMLKNGGTL
QTCALRPMCIYGERSPLISNIIIMALKHKGILRSFGKFNTANPVYVGNVAWAHILAARGL
RDPKKSPNIQGEFYYISDDTPHQSFDDISYTLSKEWGFCLDSSWSLPVPLLYWLAFLLET
VSFLLSPIYRYIPPFNRHLVTLSGSTFTFSYKKAQRDLGYEPLVSWEEAKQKTSEWIGTL
VEQHRETLDTKSQ
>sp|P35730|ODBB_RAT 2-oxoisovalerate dehydrogenase subunit beta, mitochondrial OS=Rattus norvegicus GN=Bckdhb PE=1 SV=3
MAAVAARAGGLLRLGAAGAERRRRGLRCAALVQGFLQPAVDDASQKRRVAHFTFQPDPES
LQYGQTQKMNLFQSITSALDNSLAKDPTAVIFGEDVAFGGVFRCTVGLRDKYGKDRVFNT
PLCEQGIVGFGIGIAVTGATAIAEIQFADYIFPAFDQIVNEAAKYRYRSGDLFNCGSLTI
RAPWGCVGHGALYHSQSPEAFFAHCPGIKVVIPRSPFQAKGLLLSCIEDKNPCIFFEPKI
LYRAAVEQVPVEPYKIPLSQAEVIQEGSDVTLVAWGTQVHVIREVASMAQEKLGVSCEVI
DLRTIVPWDVDTVCKSVIKTGRLLISHEAPLTGGFASEISSTVQEECFLNLEAPISRVCG
YDTPFPHIFEPFYIPDKWKCYDALRKMINY
ADD COMMENT

Login before adding your answer.

Traffic: 2519 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6