Hi everyone,
This is my first question on Biostars and I hope I could get some help regarding this issue.
I have two files:
File A : which contains FASTA sequence file (protein Format)
Example for File A
File B: list of X number of proteins (without their sequences)
Example for File B:
Protein IDs
SEN0002-thrB-missing_gene_synonym_qualifer-CAR31593.1-homoserine kinase-2565:3494 Forward
SEN0003-thrC-missing_gene_synonym_qualifer-CAR31594.1-threonine synthase-3498:4784 Forward
SEN0004-yaaA-missing_gene_synonym_qualifer-CAR31595.1-conserved hypothetical protein-4878:5651 Reverse
SEN0006-talB-missing_gene_synonym_qualifer-CAR31597.1-transaldolase B-7429:8382 Forward
SEN0007-mog-missing_gene_synonym_qualifer-CAR31598.1-molybdopterin biosynthesis Mog protein-8493:9083 Forward
SEN0011-dnaK-missing_gene_synonym_qualifer-CAR31602.1-DnaK protein (heat shock protein 70)-11358:13274 Forward
SEN0012-dnaJ-missing_gene_synonym_qualifer-CAR31603.1-DnaJ protein-13360:14499 Forward
SEN0043-rpsT-missing_gene_synonym_qualifer-CAR31634.1-30S ribosomal protein S20-52034:52297 Reverse
SEN0046-ileS-missing_gene_synonym_qualifer-CAR31637.1-isoleucyl-tRNA synthetase-53609:56443 Forward
SEN0048-slpA-missing_gene_synonym_qualifer-CAR31639.1-probable FkbB-type 16 kD peptidyl-prolyl cis-trans isomerase-57098:57547 Forward
SEN0065-dapB-missing_gene_synonym_qualifer-CAR31655.1-dihydrodipicolinate reductase-73766:74587 Forward
SEN0066-carA-missing_gene_synonym_qualifer-CAR31656.1-carbamoyl-phosphate synthase small chain-75449:76597 Forward
SEN0067-carB-missing_gene_synonym_qualifer-CAR31657.1-carbamoyl-phosphate synthase large chain-76616:79843 Forward
SEN0089-folA-missing_gene_synonym_qualifer-CAR31676.1-dihydrofolate reductase type I-100408:100887 Forward
SEN0092-ksgA-missing_gene_synonym_qualifer-CAR31679.1-dimethyladenosine transferase-102232:103053 Reverse
SEN0094-surA-missing_gene_synonym_qualifer-CAR31681.1-survival protein SurA precursor-104039:105325 Reverse
SEN0113-leuB-missing_gene_synonym_qualifer-CAR31702.1-3-isopropylmalate dehydrogenase-130762:131853 Reverse
SEN0124-murE-missing_gene_synonym_qualifer-CAR31713.1-UDP-N-acetylmuramoylalanyl-D-glutamate--2,6-dia minopim ligase-143165:144652 Forward
SEN0125-murF-missing_gene_synonym_qualifer-CAR31714.1-UDP-N-acetylmuramoylalanyl-D-glutamyl-2,6-diami nopimelate--D-alan alanyl ligase-144649:146007 Forward
**My question is: how can I merge these 2 files to extract the sequence of each protein of file B from file A. (in this case there is only 20 proteins but I also have cases where I have 1000 proteins!!).
I started a course in Rstudio last week, is there a script to use for this task?
Thank you a lot in advance!
Best!
Solasol
Welcome to Biostars. What have you tried? Text processing is much simpler in perl, python or Linux.
Dear Vari
I tired in R but did not manage to make a script!
I barely used R, so for me all this is black box :)
Cheers
If not R, you can look at this
If you don't get this sorted out today, just reply to this comment and I will post something in python that you can use to accomplish the task easily