Entering edit mode
10.3 years ago
sw.arker
▴
70
Hello,
I want to extract some specific strings in csv file, using either excel, R or Python.
for example as below: I want to find string from column A in column B and return in column C with 5 amino acid before and after N; thanks!!
A B C
INETTDFR MHRFLLMLLFPFSDNRPMMFFRSFIVFFFLIFFASNVSSRKQTYVIHT IVGKINETTDF
VTTSTKHIVTSLFNSLQTENINDDDFSLPEIHYIYENAMSGFSATLTDDQLDT
VKNTKGFISAYPDELLSLHTTYSHEFLGLEFGIGLWNETSLSSDVIIGLVDTG
ISPEHVSFRDTHMTPVPSRWRGSCDEGTNFSSSECNKKIIGASAFYKGYE
SIVGKINETTDFRSTRDAQGHGTHTASTAAGDIVPKANYFGQAKGLASGM
RFTSRIAAYKACWALGCASTDVIAAIDRAILDGVDVISLSLGGSSRPFYVDP
IAIAGFGAMQKNIFVSCSAGNSGPTASTVSNGAPWLMTVAASYTDRTFPAIV
RIGNRKSLVGSSLYKGKSLKNLPLAFNRTAGEESGAVFCIRDSLKRELVEGK
IVICLRGASGRTAKGEEVKRSGGAAMLLVSTEAEGEELLADPHVLPAVSLGF
SDGKTLLNYLAGAANATASVRFRGTAYGATAPMVAAFSSRGPSVAGPEIAKP
DIAAPGLNILAGWSPFSSPSLLRSDPRRVQFNIISGTSMACPHISGIAALIKSV
HGDWSPAMIKSAIMTTARITDNRNRPIGDRGAAGAESAATAFAFGAGNVDPT
RAVDPGLVYDTSTVDYLNYLCSLNYTSERILLFSGTNYTCASNAVVLSPGDLN
YPSFAVNLVNGANLKTVRYKRTVTNVGSPTCEYMVHVEEPKGVKVRVEPKVL
KFQKARERLSYTVTYDAEASRNSSSSSFGVLVWICDKYNVRSPIAVTWE
OK, so what have you tried? This should be pretty straight forward in python or R (no clue about excel).
Provided example doesn't seem to make sense:
A =
INETTDFR
B =
...IGASAFYKGYESIVGKINETTDFRSTRDAQGHGTHTAST...
C =
IVGKINETTDF
The goal is to extract the string from B (whole protein sequence) according to A (identified peptide sequence window), and output as C but taking additional 5 amino acid before and after N.
The comment was due to the example not including the 5 amino acids following the matched string (in fact, it didn't even include the entire matched string).
the output does not include the entire string A, that's ture. because the aim is to get the sequence window with "N" in the middle and 5 aa in front and 5 aa after. the original identified peptide string A did not provide the uniform peptide sequence window with "N" in the middle. That's what I am trying to get by pulling out the specific sequence window from the original protein sequence.
Ah, both zx8754 and I misread then. I think we both took N as a variable needle.
Sorry for the confusion, the N (asparagine) which is potential glycosylated, and I am trying to get the peptide uniform window with the N in the middle. It makes easier and suitable for pattern analysis.
I modified my script to anchor on the residue (or residues) of interest within A. I think it should work but you probably would want to test it, first.
thanks! Alex, I will definitely try it. by the way, I just find another complicated way to get it by using excel and regular expression tool, just as my backup ;-)
This is a basic programming task, check out the csv module in python.
Thanks Alex!!
I will try it and modify it accordingly.
Cheers!