Specific string extraction use Excel, R or Python
3
1
Entering edit mode
10.3 years ago
sw.arker ▴ 70

Hello,

I want to extract some specific strings in csv file, using either excel, R or Python.

for example as below: I want to find string from column A in column B and return in column C with 5 amino acid before and after N; thanks!!

A           B                                                       C
INETTDFR    MHRFLLMLLFPFSDNRPMMFFRSFIVFFFLIFFASNVSSRKQTYVIHT        IVGKINETTDF
            VTTSTKHIVTSLFNSLQTENINDDDFSLPEIHYIYENAMSGFSATLTDDQLDT
            VKNTKGFISAYPDELLSLHTTYSHEFLGLEFGIGLWNETSLSSDVIIGLVDTG
            ISPEHVSFRDTHMTPVPSRWRGSCDEGTNFSSSECNKKIIGASAFYKGYE
            SIVGKINETTDFRSTRDAQGHGTHTASTAAGDIVPKANYFGQAKGLASGM
            RFTSRIAAYKACWALGCASTDVIAAIDRAILDGVDVISLSLGGSSRPFYVDP
            IAIAGFGAMQKNIFVSCSAGNSGPTASTVSNGAPWLMTVAASYTDRTFPAIV
            RIGNRKSLVGSSLYKGKSLKNLPLAFNRTAGEESGAVFCIRDSLKRELVEGK
            IVICLRGASGRTAKGEEVKRSGGAAMLLVSTEAEGEELLADPHVLPAVSLGF
            SDGKTLLNYLAGAANATASVRFRGTAYGATAPMVAAFSSRGPSVAGPEIAKP
            DIAAPGLNILAGWSPFSSPSLLRSDPRRVQFNIISGTSMACPHISGIAALIKSV
            HGDWSPAMIKSAIMTTARITDNRNRPIGDRGAAGAESAATAFAFGAGNVDPT
            RAVDPGLVYDTSTVDYLNYLCSLNYTSERILLFSGTNYTCASNAVVLSPGDLN
            YPSFAVNLVNGANLKTVRYKRTVTNVGSPTCEYMVHVEEPKGVKVRVEPKVL
            KFQKARERLSYTVTYDAEASRNSSSSSFGVLVWICDKYNVRSPIAVTWE
python R excel • 5.2k views
ADD COMMENT
2
Entering edit mode

OK, so what have you tried? This should be pretty straight forward in python or R (no clue about excel).

ADD REPLY
1
Entering edit mode

Provided example doesn't seem to make sense:

A = INETTDFR

B = ...IGASAFYKGYESIVGKINETTDFRSTRDAQGHGTHTAST...

C = IVGKINETTDF

ADD REPLY
0
Entering edit mode

The goal is to extract the string from B (whole protein sequence) according to A (identified peptide sequence window), and output as C but taking additional 5 amino acid before and after N.

ADD REPLY
1
Entering edit mode

The comment was due to the example not including the 5 amino acids following the matched string (in fact, it didn't even include the entire matched string).

ADD REPLY
0
Entering edit mode

the output does not include the entire string A, that's ture. because the aim is to get the sequence window with "N" in the middle and 5 aa in front and 5 aa after. the original identified peptide string A did not provide the uniform peptide sequence window with "N" in the middle. That's what I am trying to get by pulling out the specific sequence window from the original protein sequence.

ADD REPLY
1
Entering edit mode

Ah, both zx8754 and I misread then. I think we both took N as a variable needle.

ADD REPLY
1
Entering edit mode

Sorry for the confusion, the N (asparagine) which is potential glycosylated, and I am trying to get the peptide uniform window with the N in the middle. It makes easier and suitable for pattern analysis.

ADD REPLY
1
Entering edit mode

I modified my script to anchor on the residue (or residues) of interest within A. I think it should work but you probably would want to test it, first.

ADD REPLY
0
Entering edit mode

thanks! Alex, I will definitely try it. by the way, I just find another complicated way to get it by using excel and regular expression tool, just as my backup ;-)

ADD REPLY
0
Entering edit mode

This is a basic programming task, check out the csv module in python.

ADD REPLY
0
Entering edit mode

Thanks Alex!!

I will try it and modify it accordingly.

Cheers!

ADD REPLY
7
Entering edit mode
10.3 years ago
zx8754 12k

Using Excel*:

=MID(B1,FIND(A1,B1)+FIND("N",A1)-6,5) & "N" & MID(B1,FIND(A1,B1)+FIND("N",A1),5)

*Don't use Excel :)

ADD COMMENT
0
Entering edit mode

super!! thanks zx8754, it works. actually I was using small part of the same code in combination of regular expression webtool to accomplish the job ;-)

ADD REPLY
1
Entering edit mode

I am a rookie in coding (most excel depending), but more and more I figure out for bigger and complicated task, coding makes it more easier and productive!!

ADD REPLY
6
Entering edit mode
10.3 years ago
Guangchuang Yu ★ 2.6k

Using R:

getPreceding <- function(A, B, N = 4) {
  x <- regexpr(A, B)
  substring(B, x - N, x + attr(x, "match.length") - 1)
}

A = "INETTDFR"
B = "...IGASAFYKGYESIVGKINETTDFRSTRDAQGHGTHTAST..."
getPreceding(A, B)
# [1] "IVGKINETTDFR"
ADD COMMENT

Login before adding your answer.

Traffic: 1631 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6