Question

Specific string extraction use Excel, R or Python

1

Entering edit mode

10.5 years ago

sw.arker ▴ 70

Hello,

I want to extract some specific strings in csv file, using either excel, R or Python.

for example as below: I want to find string from column A in column B and return in column C with 5 amino acid before and after N; thanks!!

A           B                                                       C
INETTDFR    MHRFLLMLLFPFSDNRPMMFFRSFIVFFFLIFFASNVSSRKQTYVIHT        IVGKINETTDF
            VTTSTKHIVTSLFNSLQTENINDDDFSLPEIHYIYENAMSGFSATLTDDQLDT
            VKNTKGFISAYPDELLSLHTTYSHEFLGLEFGIGLWNETSLSSDVIIGLVDTG
            ISPEHVSFRDTHMTPVPSRWRGSCDEGTNFSSSECNKKIIGASAFYKGYE
            SIVGKINETTDFRSTRDAQGHGTHTASTAAGDIVPKANYFGQAKGLASGM
            RFTSRIAAYKACWALGCASTDVIAAIDRAILDGVDVISLSLGGSSRPFYVDP
            IAIAGFGAMQKNIFVSCSAGNSGPTASTVSNGAPWLMTVAASYTDRTFPAIV
            RIGNRKSLVGSSLYKGKSLKNLPLAFNRTAGEESGAVFCIRDSLKRELVEGK
            IVICLRGASGRTAKGEEVKRSGGAAMLLVSTEAEGEELLADPHVLPAVSLGF
            SDGKTLLNYLAGAANATASVRFRGTAYGATAPMVAAFSSRGPSVAGPEIAKP
            DIAAPGLNILAGWSPFSSPSLLRSDPRRVQFNIISGTSMACPHISGIAALIKSV
            HGDWSPAMIKSAIMTTARITDNRNRPIGDRGAAGAESAATAFAFGAGNVDPT
            RAVDPGLVYDTSTVDYLNYLCSLNYTSERILLFSGTNYTCASNAVVLSPGDLN
            YPSFAVNLVNGANLKTVRYKRTVTNVGSPTCEYMVHVEEPKGVKVRVEPKVL
            KFQKARERLSYTVTYDAEASRNSSSSSFGVLVWICDKYNVRSPIAVTWE

python R excel • 5.4k views

ADD COMMENT • link updated 3.2 years ago by Ram 44k • written 10.5 years ago by sw.arker ▴ 70

2

Entering edit mode

OK, so what have you tried? This should be pretty straight forward in python or R (no clue about excel).

ADD REPLY • link 10.5 years ago by Devon Ryan 105k

1

Entering edit mode

Provided example doesn't seem to make sense:

A = INETTDFR

B = ...IGASAFYKGYESIVGKINETTDFRSTRDAQGHGTHTAST...

C = IVGKINETTDF

ADD REPLY • link updated 3.2 years ago by Ram 44k • written 10.5 years ago by zx8754 12k

0

Entering edit mode

The goal is to extract the string from B (whole protein sequence) according to A (identified peptide sequence window), and output as C but taking additional 5 amino acid before and after N.

ADD REPLY • link updated 3.2 years ago by Ram 44k • written 10.5 years ago by sw.arker ▴ 70

1

Entering edit mode

The comment was due to the example not including the 5 amino acids following the matched string (in fact, it didn't even include the entire matched string).

ADD REPLY • link updated 3.2 years ago by Ram 44k • written 10.5 years ago by Devon Ryan 105k

0

Entering edit mode

the output does not include the entire string A, that's ture. because the aim is to get the sequence window with "N" in the middle and 5 aa in front and 5 aa after. the original identified peptide string A did not provide the uniform peptide sequence window with "N" in the middle. That's what I am trying to get by pulling out the specific sequence window from the original protein sequence.

ADD REPLY • link 10.5 years ago by sw.arker ▴ 70

1

Entering edit mode

Ah, both zx8754 and I misread then. I think we both took N as a variable needle.

ADD REPLY • link 10.5 years ago by Devon Ryan 105k

1

Entering edit mode

Sorry for the confusion, the N (asparagine) which is potential glycosylated, and I am trying to get the peptide uniform window with the N in the middle. It makes easier and suitable for pattern analysis.

ADD REPLY • link updated 3.2 years ago by Ram 44k • written 10.5 years ago by sw.arker ▴ 70

1

Entering edit mode

I modified my script to anchor on the residue (or residues) of interest within A. I think it should work but you probably would want to test it, first.

ADD REPLY • link 10.5 years ago by Alex Reynolds 36k

0

Entering edit mode

thanks! Alex, I will definitely try it. by the way, I just find another complicated way to get it by using excel and regular expression tool, just as my backup ;-)

ADD REPLY • link 10.5 years ago by sw.arker ▴ 70

0

Entering edit mode

This is a basic programming task, check out the csv module in python.

ADD REPLY • link 10.5 years ago by pld 5.1k

0

Entering edit mode

Thanks Alex!!

I will try it and modify it accordingly.

Cheers!

ADD REPLY • link updated 3.2 years ago by Ram 44k • written 10.5 years ago by sw.arker ▴ 70

Ram · Accepted Answer · 2014-08-14

7

Entering edit mode

10.5 years ago

zx8754 12k

Using Excel*:

=MID(B1,FIND(A1,B1)+FIND("N",A1)-6,5) & "N" & MID(B1,FIND(A1,B1)+FIND("N",A1),5)

*Don't use Excel :)

ADD COMMENT • link updated 3.2 years ago by Ram 44k • written 10.5 years ago by zx8754 12k

0

Entering edit mode

super!! thanks zx8754, it works. actually I was using small part of the same code in combination of regular expression webtool to accomplish the job ;-)

ADD REPLY • link 10.5 years ago by sw.arker ▴ 70

1

Entering edit mode

I am a rookie in coding (most excel depending), but more and more I figure out for bigger and complicated task, coding makes it more easier and productive!!

ADD REPLY • link updated 3.2 years ago by Ram 44k • written 10.5 years ago by sw.arker ▴ 70

zx8754 · Accepted Answer · 2014-08-14

6

Entering edit mode

10.5 years ago

Guangchuang Yu ★ 2.6k

Using R:

getPreceding <- function(A, B, N = 4) {
  x <- regexpr(A, B)
  substring(B, x - N, x + attr(x, "match.length") - 1)
}

A = "INETTDFR"
B = "...IGASAFYKGYESIVGKINETTDFRSTRDAQGHGTHTAST..."
getPreceding(A, B)
# [1] "IVGKINETTDFR"

ADD COMMENT • link updated 6.9 years ago by zx8754 12k • written 10.5 years ago by Guangchuang Yu ★ 2.6k