Extract FASTA if two matches
2
0
Entering edit mode
7.1 years ago
samuel ▴ 260

Hi, I have a fasta file and I want to only extract the sequences and the part of the header before and after the '|' (eg P00533 EGFR_HUMAN Epidermal growth factor receptor) if OS=Homo Sapiens. Is this possible quth a quick awk one liner??

    >sp|P00533|EGFR_HUMAN Epidermal growth factor receptor OS=Homo sapiens GN=EGFR PE=1 SV=2
    MRPSGTAGAALLALLAALCPASRALEEKKVCQGTSNKLTQLGTFEDHFLSLQRMFNNCEV
    VLGNLEITYVQRNYDLSFLKTIQEVAGYVLIALNTVERIPLENLQIIRGNMYYENSYALA
    VLSNYDANKTGLKELPMRNLQEILHGAVRFSNNPALCNVESIQWRDIVSSDFLSNMSMDF
    QNHLGSCQKCDPSCPNGSCWGAGEENCQKLTKIICAQQCSGRCRGKSPSDCCHNQCAAGC
    TGPRESDCLVCRKFRDEATCKDTCPPLMLYNPTTYQMDVNPEGKYSFGATCVKKCPRNYV
    VTDHGSCVRACGADSYEMEEDGVRKCKKCEGPCRKVCNGIGIGEFKDSLSINATNIKHFK
    NCTSISGDLHILPVAFRGDSFTHTPPLDPQELDILKTVKEITGFLLIQAWPENRTDLHAF
    >sp|P31749|AKT1_HUMAN RAC-alpha serine/threonine-protein kinase OS=Homo sapiens GN=AKT1 PE=1
    MSDVAIVKEGWLHKRGEYIKTWRPRYFLLKNDGTFIGYKERPQDVDQREAPLNNFSVAQC
    QLMKTERPRPNTFIIRCLQWTTVIERTFHVETPEEREEWTTAIQTVADGLKKQEEEEMDF
    RSGSPSDNSGAEEMEVSLAKPKHRVTMNEFEYLKLLGKGTFGKVILVKEKATGRYYAMKI
    >tr|P91634|P91634_DROME PI-3 kinase OS=Drosophila melanogaster GN=Pi3K92E PE=1 SV=1
    MNMMDNRALAYVAHQPKYETPPEEAEPPCMRFSVNLWKNEMLNWVDLICLLPNGFLLELR
    VNPANTIQVIKVEMVNQAKQMPLGYVIKEACEYQVYGISTFNIEPYTDETKRLSEVQPYF
    GILSLGERTDTTSFSSDYELTKMVNGMIGTTFDHNRTHGSPEIDDFRLYMTQTCDNIELE
sequence • 1.8k views
ADD COMMENT
0
Entering edit mode

Thank you very much Kevin. I'm new to bash, I understand from your code how you separate the line and only print either side of the '|' with the 'split(a[1], b, "|"); print ">"b[2]" "b[3]}'. Can you explain what the bprint is doing??

ADD REPLY
0
Entering edit mode

Hey Zoe, could you repost this as a comment to my answer? Just to maintain fluidity of the thread. You can delete this and then re-post above. I have answered your comment already there!

ADD REPLY
0
Entering edit mode

Awesome, thank you Kevin!

ADD REPLY
0
Entering edit mode

Please use ADD COMMENT or ADD REPLY to answer to previous reactions, as such this thread remains logically structured and easy to follow. I have now moved your post but as you can see it's not optimal. Adding an answer should only be used for providing a solution to the question asked.

If an answer was helpful you should upvote it, if the answer resolved your question you should mark it as accepted. Upvote|Bookmark|Accept

ADD REPLY
0
Entering edit mode
7.1 years ago

You can try this (assumes your data is in 'test.fasta'):

awk '{if (/^>/ && /OS=Homo sapiens/) {bPrint=1; split($0, a, " OS="); split(a[1], b, "|"); print ">"b[2]" "b[3]}; if (/^>/ && !/OS=Homo sapiens/) {bPrint=0}; if (!/^>/ && bPrint==1) print $0}' test.fasta

>P00533 EGFR_HUMAN Epidermal growth factor receptor
MRPSGTAGAALLALLAALCPASRALEEKKVCQGTSNKLTQLGTFEDHFLSLQRMFNNCEV
VLGNLEITYVQRNYDLSFLKTIQEVAGYVLIALNTVERIPLENLQIIRGNMYYENSYALA
VLSNYDANKTGLKELPMRNLQEILHGAVRFSNNPALCNVESIQWRDIVSSDFLSNMSMDF
QNHLGSCQKCDPSCPNGSCWGAGEENCQKLTKIICAQQCSGRCRGKSPSDCCHNQCAAGC
TGPRESDCLVCRKFRDEATCKDTCPPLMLYNPTTYQMDVNPEGKYSFGATCVKKCPRNYV
VTDHGSCVRACGADSYEMEEDGVRKCKKCEGPCRKVCNGIGIGEFKDSLSINATNIKHFK
NCTSISGDLHILPVAFRGDSFTHTPPLDPQELDILKTVKEITGFLLIQAWPENRTDLHAF
>P31749 AKT1_HUMAN RAC-alpha serine/threonine-protein kinase
MSDVAIVKEGWLHKRGEYIKTWRPRYFLLKNDGTFIGYKERPQDVDQREAPLNNFSVAQC
QLMKTERPRPNTFIIRCLQWTTVIERTFHVETPEEREEWTTAIQTVADGLKKQEEEEMDF
RSGSPSDNSGAEEMEVSLAKPKHRVTMNEFEYLKLLGKGTFGKVILVKEKATGRYYAMKI
ADD COMMENT
0
Entering edit mode

Hey Zoe, bPrint is just a boolean flag that I created myself.

I'll break the command into pseudo code:


Matching header

if (/^>/ && /OS=Homo sapiens/) {bPrint=1; split($0, a, " OS="); split(a[1], b, "|"); print ">"b[2]" "b[3]};

If the line begins with '>' AND contains 'OS=Homo sapiens', set bPrint to 'TRUE' and then split/print the header out


Non-matching header

if (/^>/ && !/OS=Homo sapiens/) {bPrint=0};

If the line begins with '>' AND does NOT contain 'OS=Homo sapiens', set bPrint to 'FALSE'


Line of sequence following a matching header

if (!/^>/ && bPrint==1) print $0}'

If the line does NOT begin with '>' AND bPrint is 'TRUE', print the line


If there are empty lines, it will just print them.

ADD REPLY
0
Entering edit mode
7.1 years ago

with seqkit and sed:

$ seqkit grep -nrp  "OS=Homo sapiens" test.fa | sed '/^>/ s/.*|\(P.*\)OS.*/>\1/g'

>P00533|EGFR_HUMAN Epidermal growth factor receptor 
MRPSGTAGAALLALLAALCPASRALEEKKVCQGTSNKLTQLGTFEDHFLSLQRMFNNCEV
VLGNLEITYVQRNYDLSFLKTIQEVAGYVLIALNTVERIPLENLQIIRGNMYYENSYALA
VLSNYDANKTGLKELPMRNLQEILHGAVRFSNNPALCNVESIQWRDIVSSDFLSNMSMDF
QNHLGSCQKCDPSCPNGSCWGAGEENCQKLTKIICAQQCSGRCRGKSPSDCCHNQCAAGC
TGPRESDCLVCRKFRDEATCKDTCPPLMLYNPTTYQMDVNPEGKYSFGATCVKKCPRNYV
VTDHGSCVRACGADSYEMEEDGVRKCKKCEGPCRKVCNGIGIGEFKDSLSINATNIKHFK
NCTSISGDLHILPVAFRGDSFTHTPPLDPQELDILKTVKEITGFLLIQAWPENRTDLHAF
>P31749|AKT1_HUMAN RAC-alpha serine/threonine-protein kinase 
MSDVAIVKEGWLHKRGEYIKTWRPRYFLLKNDGTFIGYKERPQDVDQREAPLNNFSVAQC
QLMKTERPRPNTFIIRCLQWTTVIERTFHVETPEEREEWTTAIQTVADGLKKQEEEEMDF
RSGSPSDNSGAEEMEVSLAKPKHRVTMNEFEYLKLLGKGTFGKVILVKEKATGRYYAMKI
ADD COMMENT

Login before adding your answer.

Traffic: 1946 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6