Forum:merge Header by extracting protein sequence
1
0
Entering edit mode
8.8 years ago
solasol • 0

Hi everyone,

This is my first question on Biostars and I hope I could get some help regarding this issue.

I have two files:

File A : which contains FASTA sequence file (protein Format)

Example for File A

File B: list of X number of proteins (without their sequences)

Example for File B:

Protein IDs
SEN0002-thrB-missing_gene_synonym_qualifer-CAR31593.1-homoserine kinase-2565:3494 Forward
SEN0003-thrC-missing_gene_synonym_qualifer-CAR31594.1-threonine synthase-3498:4784 Forward
SEN0004-yaaA-missing_gene_synonym_qualifer-CAR31595.1-conserved hypothetical protein-4878:5651 Reverse
SEN0006-talB-missing_gene_synonym_qualifer-CAR31597.1-transaldolase B-7429:8382 Forward
SEN0007-mog-missing_gene_synonym_qualifer-CAR31598.1-molybdopterin biosynthesis Mog protein-8493:9083 Forward
SEN0011-dnaK-missing_gene_synonym_qualifer-CAR31602.1-DnaK protein (heat shock protein 70)-11358:13274 Forward
SEN0012-dnaJ-missing_gene_synonym_qualifer-CAR31603.1-DnaJ protein-13360:14499 Forward
SEN0043-rpsT-missing_gene_synonym_qualifer-CAR31634.1-30S ribosomal protein S20-52034:52297 Reverse
SEN0046-ileS-missing_gene_synonym_qualifer-CAR31637.1-isoleucyl-tRNA synthetase-53609:56443 Forward
SEN0048-slpA-missing_gene_synonym_qualifer-CAR31639.1-probable FkbB-type 16 kD peptidyl-prolyl cis-trans isomerase-57098:57547 Forward
SEN0065-dapB-missing_gene_synonym_qualifer-CAR31655.1-dihydrodipicolinate reductase-73766:74587 Forward
SEN0066-carA-missing_gene_synonym_qualifer-CAR31656.1-carbamoyl-phosphate synthase small chain-75449:76597 Forward
SEN0067-carB-missing_gene_synonym_qualifer-CAR31657.1-carbamoyl-phosphate synthase large chain-76616:79843 Forward
SEN0089-folA-missing_gene_synonym_qualifer-CAR31676.1-dihydrofolate reductase type I-100408:100887 Forward
SEN0092-ksgA-missing_gene_synonym_qualifer-CAR31679.1-dimethyladenosine transferase-102232:103053 Reverse
SEN0094-surA-missing_gene_synonym_qualifer-CAR31681.1-survival protein SurA precursor-104039:105325 Reverse
SEN0113-leuB-missing_gene_synonym_qualifer-CAR31702.1-3-isopropylmalate dehydrogenase-130762:131853 Reverse
SEN0124-murE-missing_gene_synonym_qualifer-CAR31713.1-UDP-N-acetylmuramoylalanyl-D-glutamate--2,6-dia minopim ligase-143165:144652 Forward
SEN0125-murF-missing_gene_synonym_qualifer-CAR31714.1-UDP-N-acetylmuramoylalanyl-D-glutamyl-2,6-diami nopimelate--D-alan alanyl ligase-144649:146007 Forward

**My question is: how can I merge these 2 files to extract the sequence of each protein of file B from file A. (in this case there is only 20 proteins but I also have cases where I have 1000 proteins!!).

I started a course in Rstudio last week, is there a script to use for this task?

Thank you a lot in advance!

Best!

Solasol

R sequence • 2.6k views
ADD COMMENT
0
Entering edit mode

Welcome to Biostars. What have you tried? Text processing is much simpler in perl, python or Linux.

ADD REPLY
0
Entering edit mode

Dear Vari

I tired in R but did not manage to make a script!

I barely used R, so for me all this is black box :)

Cheers

ADD REPLY
0
Entering edit mode

If not R, you can look at this

ADD REPLY
0
Entering edit mode

If you don't get this sorted out today, just reply to this comment and I will post something in python that you can use to accomplish the task easily

ADD REPLY
0
Entering edit mode
8.8 years ago
GenoMax 147k

Step 1: Get faSomeRecords utility from Jim Kent at UCSC. (Linux link, OS X or source available).

Step 2: Make the file executable

$ chmod u+x faSomeRecords

Step 3: Run faSomeRecords

$ ./faSomeRecords
faSomeRecords - Extract multiple fa records
usage:
   faSomeRecords in.fa listFile out.fa
  • in.fa = Your sequence file
  • listfile = file with sequence names
  • out.fa = file to store the result
ADD COMMENT
0
Entering edit mode

Dear genomax2,

Sorry It might sound a naive question, but since I have really no experience in all these scripts and softwares, I am a little bit confused.. I clicked on the link you've sent, with which program should I open this file? And where should I Run the faSomeRecords?

Thanks in advance!

ADD REPLY
0
Entering edit mode

You should save the file linked above (right-click on the link choose "save as" or use wget to download directly) to a linux machine (this file is meant for use with linux and will not work on windows). The file linked is an executable program and you are going to run it as I showed above.

Do you have access to a linux server/computer? What OS are you using?

ADD REPLY
0
Entering edit mode

I am using Windows 7. I don't have access to linux server, so I think in this case I should download Linux.

ADD REPLY
0
Entering edit mode

Downloading linux would be best, but you could also download Cygwin to use a Linux like environment directly on your Windows computer if you don't feel comfortable installing a second OS.

ADD REPLY
0
Entering edit mode

If you are going to work with bioinformatics programs it is highly advisable to familiarize yourself with command line (linux). Here is a nice online resource you can use.

You can use a virtual machine (for simple tasks like this) running linux on windows.

ADD REPLY

Login before adding your answer.

Traffic: 2542 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6