Question

Deleting certain rows of a dataFrame using regular expression and pandas

0

Entering edit mode

2.4 years ago

matt81rd ▴ 10

Hi All,

I am sure this is probabky very simple but just getting confused with how to define the pattern i want to delete from my dataframe.

In my dataframe i have a column called 'INFO: Variant' and within this column are SNPs (reference amino acid, position and alternative amino acid), e.g. 'V179G'. However, within this dataframe are some SNPs where the reference amino acid and alternative are the same (e.g. 'V317V') and i need to be able to remove all these. I'm sure i need to something like below but not sure what the pattern would be as the ref and alt amino acid varies throughout. Is there someway to define the pattern as a pair of the same letter of the alphabet flanking numbers?

patternDel = "[A-Z]\d[A-Z]"

filter = df1['Event Name'].str.contains(patternDel)

df2 = df1[~filter]

I have just selected 10 random rows of my table here but i need to reomve rows: 3, 8, 13, 17

POS ID  REF ALT QUAL    FILTER  INFO:Gene   INFO:Variant
CHROM                               
ChrI    3987274 .   t   c   .   .   ddn I144T
ChrI    2715280 .   a   c   .   .   eis L18R
ChrI    4244183 .   g   c   .   .   embA    V317V
ChrI    4247553 .   g   c   .   .   embB    S347T
ChrI    4326938 .   a   c   .   .   ethA    V179G
ChrI    1674048 .   g   a   .   .   fabG1   g-154a
ChrI    4408102 .   c   t   .   .   gid G34E
ChrI    8898    .   c   t   .   .   gyrA    L533L
ChrI    5886    .   a   g   .   .   gyrB    D216G
ChrI    1674772 .   g   a   .   .   inhA    A191T
ChrI    2154503 .   t   c   .   .   katG    K537E
ChrI    2288914 .   c   a   .   .   pncA    D110Y
ChrI    801351  .   g   a   .   .   rplC    L181L
ChrI    760402  .   t   c   .   .   rpoB    I199T
ChrI    781373  .   g   c   .   .   rpsL    g-187c
ChrI    1472846 .   c   a   .   .   rrs c1001a
ChrI    1918497 .   g   a   .   .   tlyA    E186E

python pandas Regex • 1.3k views

ADD COMMENT • link updated 2.4 years ago by iraun 6.2k • written 2.4 years ago by matt81rd ▴ 10

score 3 · Accepted Answer · 2022-11-22

3

Entering edit mode

2.4 years ago

iraun 6.2k

Something like this?

   import re
   def start_and_end_with_same(sample_str):
       """Check endings"""
       pattern = r'^[A-z]$|^([A-z]).*\1$'
       return True if re.search(pattern, sample_str) else False

    df['same_aminoacids'] = df.apply(lambda row : start_and_end_with_same(row[8]), axis=1)
    # Filter out equal rows
    df[~df.same_aminoacids]

ADD COMMENT • link 2.4 years ago by iraun 6.2k

0

Entering edit mode

Thanks iraun this worked! Could you explain the how the pattern r'^[A-z]$|^([A-z]).*\1$' matches with the rows i want to remove? I tried for ages to work out the pattern and the closest i got was "[A-Z]\d[A-Z]"

Thanks!

ADD REPLY • link 2.4 years ago by matt81rd ▴ 10

1

Entering edit mode

Sorry the regex is a bit redundant, r'^([A-z]).*\1$' will suffice. You can use this to understand it :)

ADD REPLY • link 2.4 years ago by iraun 6.2k