Hi everyone,
This is my first question on Biostars and I hope I could get some help regarding this issue.
I have two files:
File A : which contains FASTA sequence file (protein Format)
Example for File A
>SEN4356A-thrL-missing_gene_synonym_qualifer-CAR35910.1-threonine operon leader peptide (artificial fragment)-1:4685848 Forward | |
MNRISTTTITTITITTGNGAG | |
>SEN0001-thrA-missing_gene_synonym_qualifer-CAR31592.1-aspartokinase I/homoserine dehydrogenase I-101:2563 Forward | |
MRVLKFGGTSVANAERFLRVADILESNARQGQVATVLSAPAKITNHLVAMIEKTIGGQDA | |
LPNISDAERIFSDLLAGLASAQPGFPLARLKMVVEQEFAQIKHVLHGISLLGQCPDSINA | |
ALICRGEKMSIAIMAGLLEARGHRVTVIDPVEKLLAVGHYLESTVDIAESTRRIAASQIP | |
ADHMILMAGFTAGNEKGELVVLGRNGSDYSAAVLAACLRADCCEIWTDVDGVYTCDPRQV | |
PDARLLKSMSYQEAMELSYFGAKVLHPRTITPIAQFQIPCLIKNTGNPQAPGTLIGASSD | |
DDNLPVKGISNLNNMAMFSVSGPGMKGMIGMAARVFAAMSRAGISVVLITQSSSEYSISF | |
CVPQSDCARARRAMQDEFYLELKEGLLEPLAVTERLAIISVVGDGMRTLRGISAKFFAAL | |
ARANINIVAIAQGSSERSISVVVNNDDATTGVRVTHQMLFNTDQVIEVFVIGVGGVGGAL | |
LEQLKRQQTWLKNKHIDLRVCGVANSKALLTNVHGLNLDNWQAELAQANAPFNLGRLIRL | |
VKEYHLLNPVIVDCTSSQAVADQYADFLREGFHVVTPNKKANTSSMDYYHQLRFAAAQSR | |
RKFLYDTNVGAGLPVIENLQNLLNAGDELQKFSGILSGSLSFIFGKLEEGMSLSQATALA | |
REMGYTEPDPRDDLSGMDVARKLLILARETGRELELSDIVIEPVLPDEFDASGDVTAFMA | |
HLPQLDDAFAARVAKARDEGKVLRYVGNIEEDGVCRVKIAEVDGNDPLFKVKNGENALAF | |
YSHYYQPLPLVLRGYGAGNDVTAAGVFADLLRTLSWKLGV | |
>SEN0002-thrB-missing_gene_synonym_qualifer-CAR31593.1-homoserine kinase-2565:3494 Forward | |
MVKVYAPASSANMSVGFDVLGAAVTPVDGTLLGDVVSVEAADHFRLHNLGRFADKLPPEP | |
RENIVYQCWERFCQALGKTIPVAMTLEKNMPIGSGLGSSACSVVAALVAMNEHCGKPLND | |
TRLLALMGELEGRISGSIHYDNVAPCFLGGMQLMIEENGIISQQVPGFDEWLWVLAYPGI | |
KVSTAEARAILPAQYRRQDCIAHGRHLAGFIHACYSRQPQLAAALMKDVIAEPYRARLLP | |
GFSQARQAVSEIGALASGISGSGPTLFALCDKPETAQRVADWLSKHYLQNQEGFVHICRL | |
DTAGARVVG | |
>SEN0003-thrC-missing_gene_synonym_qualifer-CAR31594.1-threonine synthase-3498:4784 Forward | |
MKLYNLKDHNEQVSFAQAVTQGLGKQQGLFFPHELPEFSLTEIDEMLNQDFVSRSAKILS | |
AFIGDEIPQQILEERVRAAFAFPAPVAQVESDVGCLELFHGPTLAFKDFGGRFMAQMLTH | |
ISGDKPVTILTATSGDTGAAVAHAFYGLENVRVVILYPRGKISPLQEKLFCTLGGNIETV | |
AIDGDFDACQALVKQAFDDEELKTALGLNSANSINISRLLAQICYYFEAVAQLPQGARNQ | |
LVISVPSGNFGDLTAGLLAKSLGLPVKRFIAATNINDTVPRFLQDGKWAPKATQATLSNA | |
MDVSQPNNWPRVEELFRRKIWRLTELGYAAVDDTTTQQTMRELKAKGYISEPHAAVAYRA | |
LRDQLNPGEYGLFLGTAHPAKFKESVESILGETLALPEALAERADLPLLSHHLPADFAAL | |
RKLMMTRQ | |
>SEN0004-yaaA-missing_gene_synonym_qualifer-CAR31595.1-conserved hypothetical protein-4878:5651 Reverse | |
MLILISPAKTLDYQSPLATTRYTQPELLDHSQQLIQQARQLSAPQISRLMGISDKLADLN | |
ATRFHDWQPHFTPDNARQAILAFKGDVYTGLQAETFNDADFDFAQQHLRMLSGLYGVLRP | |
LDLMQPYRLEMGIRLENPRGKDLYQFWGDIITDKLNEALEAQGDRVVVNLASEEYFKSVK | |
PKKLNAELIKPVFLDEKNGKFKVVSFYAKKARGLMSRFIIENRLTKPEQLTAFDREGYFF | |
DEETSTQDELVFKRYEQ | |
>SEN0005-yaaJ-missing_gene_synonym_qualifer-CAR31596.1-putative amino-acid transport protein-5730:7160 Reverse | |
MPEFFSFINEILWGSVMIYLLLGAGCWFTWRTGFIQFRYIRQFSRSLKGSLSPQPGGLTS | |
FQALCTSLAARIGSGNLAGVALAIAAGGPGAVFWMWVSAIIGMATSFAECSLAQLYKERD | |
PTGQFRGGPAWYMARGLGMRWMGVVFALFLLVAYGLIFNSVQANAVSRALHFAFNIPPLI | |
SGIALAFCALLIIIRGIKGVARLMQWLIPIIALLWVAGSVFICLWHIEQMPGVIASIVKS | |
AFGWQEAAAGAAGYTLTQAITSGFQRGMFSNEAGMGSTPNAAAAATSYPPHPVAQGIVQM | |
IGVFSDTIIICTASAMIILLAGNHASHSSTEGIQLLQHAMVSLTGEWGASFVALIVILFA | |
FSSIVANYIYAENNLFFLRLHNAKAIWLLRLATLGMVIAGTLISFPLIWQLADMIMACMA | |
ITNLTAILLLSPVVYTLAGDYLRQRKLGVRPQFDPRRFPDIEPQLAPDTWDAASRD | |
>SEN0006-talB-missing_gene_synonym_qualifer-CAR31597.1-transaldolase B-7429:8382 Forward | |
MTDKLTSLRQFTTVVADTGDIAAMKLYQPQDATTNPSLILNAAQIPEYRKLIDDAVAWAK | |
QQSSDRAQQVVDATDKLAVNIGLEILKLVPGRISTEVDARLSYDTEASIAKAKRIIKLYN | |
DAGISNDRILIKLASTWQGIRAAEQLEKEGINCNLTLLFSFAQARACAEAGVYLISPFVG | |
RILDWYKANTDKKDYAPAEDPGVVSVTEIYEYYKQHGYETVVMGASFRNVGEILELAGCD | |
RLTIAPALLKELAESEGAIERKLSFSGEVKARPERITEAEFLWQHHQDPMAVDKLADGIR | |
KFAVDQEKLEKMIGDLL | |
>SEN0007-mog-missing_gene_synonym_qualifer-CAR31598.1-molybdopterin biosynthesis Mog protein-8493:9083 Forward | |
MDTLRIGLVSISDRASSGVYQDKGIPALEEWLASALTTPFEVQRRLIPDEQEIIEQTLCE | |
LVDEMSCHLVLTTGGTGPARRDVTPDATLAIADREMPGFGEQMRQISLRFVPTAILSRQV | |
GVIRKQALILNLPGQPKSIKETLEGVKADDGSVSVPGIFASVPYCIQLLDGPYVETAPEV | |
VAAFRPKSARRENMSS | |
>SEN0008-yaaH-missing_gene_synonym_qualifer-CAR31599.1-integral membrane protein-9140:9706 Reverse | |
MGNTKLANPAPLGLMGFGMTTILLNLHNAGFFALDGIILAMGIFYGGIAQIFAGLLEYKK | |
GNTFGLTAFTSYGSFWLTLVAILLMPKMGLTDAPDAQLLGAYLGLWGVFTLFMFFGTLKA | |
ARALQFVFLSLTVLFALLAVGNITGNEAIIHIAGWVGLVCGASAIYLAMGEVLNEQFGRT | |
ILPIGEAH | |
>SEN0009-htgA-missing_gene_synonym_qualifer-CAR31600.1-conserved hypothetical protein-9856:10569 Reverse | |
MNVTYLHDEDLDFLQHCSEEQLADFARLLTHNEKGKARLSSVLSHNELFKAMEGHPEQHR | |
RNWQLIAGEFQHYGGDSIANKLRGHGKQYRAILLDVAKRLKLKADKSMSTFEIEQQLLEH | |
FLRHTWQKMDAAHKQEFLQAVDAKVSELEELLPLLMKDRSLAKGVSHLLSTQLTRILRTH | |
AAMSILGHGLLRGAGLGGPVGAALNGVKAMSGSAYRVTIPAVLQIACLRRMMAAVQA | |
>SEN0010-yaaI-missing_gene_synonym_qualifer-CAR31601.1-possible exported protein-10605:11009 Reverse | |
MRSVLTISAGLLFGLALSSVAHANDHKILGVIAMPRNETNDLALKIPVCRIVKRIQLTAD | |
HGDIELSGASVYFKTARSASQSLNVPSSIKEGQTTGWININSDNDNKRCVSKITFSGHTV | |
NSSDMARLKVIGDD |
File B: list of X number of proteins (without their sequences)
Example for File B:
Protein IDs
SEN0002-thrB-missing_gene_synonym_qualifer-CAR31593.1-homoserine kinase-2565:3494 Forward
SEN0003-thrC-missing_gene_synonym_qualifer-CAR31594.1-threonine synthase-3498:4784 Forward
SEN0004-yaaA-missing_gene_synonym_qualifer-CAR31595.1-conserved hypothetical protein-4878:5651 Reverse
SEN0006-talB-missing_gene_synonym_qualifer-CAR31597.1-transaldolase B-7429:8382 Forward
SEN0007-mog-missing_gene_synonym_qualifer-CAR31598.1-molybdopterin biosynthesis Mog protein-8493:9083 Forward
SEN0011-dnaK-missing_gene_synonym_qualifer-CAR31602.1-DnaK protein (heat shock protein 70)-11358:13274 Forward
SEN0012-dnaJ-missing_gene_synonym_qualifer-CAR31603.1-DnaJ protein-13360:14499 Forward
SEN0043-rpsT-missing_gene_synonym_qualifer-CAR31634.1-30S ribosomal protein S20-52034:52297 Reverse
SEN0046-ileS-missing_gene_synonym_qualifer-CAR31637.1-isoleucyl-tRNA synthetase-53609:56443 Forward
SEN0048-slpA-missing_gene_synonym_qualifer-CAR31639.1-probable FkbB-type 16 kD peptidyl-prolyl cis-trans isomerase-57098:57547 Forward
SEN0065-dapB-missing_gene_synonym_qualifer-CAR31655.1-dihydrodipicolinate reductase-73766:74587 Forward
SEN0066-carA-missing_gene_synonym_qualifer-CAR31656.1-carbamoyl-phosphate synthase small chain-75449:76597 Forward
SEN0067-carB-missing_gene_synonym_qualifer-CAR31657.1-carbamoyl-phosphate synthase large chain-76616:79843 Forward
SEN0089-folA-missing_gene_synonym_qualifer-CAR31676.1-dihydrofolate reductase type I-100408:100887 Forward
SEN0092-ksgA-missing_gene_synonym_qualifer-CAR31679.1-dimethyladenosine transferase-102232:103053 Reverse
SEN0094-surA-missing_gene_synonym_qualifer-CAR31681.1-survival protein SurA precursor-104039:105325 Reverse
SEN0113-leuB-missing_gene_synonym_qualifer-CAR31702.1-3-isopropylmalate dehydrogenase-130762:131853 Reverse
SEN0124-murE-missing_gene_synonym_qualifer-CAR31713.1-UDP-N-acetylmuramoylalanyl-D-glutamate--2,6-dia minopim ligase-143165:144652 Forward
SEN0125-murF-missing_gene_synonym_qualifer-CAR31714.1-UDP-N-acetylmuramoylalanyl-D-glutamyl-2,6-diami nopimelate--D-alan alanyl ligase-144649:146007 Forward
**My question is: how can I merge these 2 files to extract the sequence of each protein of file B from file A. (in this case there is only 20 proteins but I also have cases where I have 1000 proteins!!).
I started a course in Rstudio last week, is there a script to use for this task?
Thank you a lot in advance!
Best!
Solasol
Welcome to Biostars. What have you tried? Text processing is much simpler in perl, python or Linux.
Dear Vari
I tired in R but did not manage to make a script!
I barely used R, so for me all this is black box :)
Cheers
If not R, you can look at this
If you don't get this sorted out today, just reply to this comment and I will post something in python that you can use to accomplish the task easily