Hi All,
I am sure this is probabky very simple but just getting confused with how to define the pattern i want to delete from my dataframe.
In my dataframe i have a column called 'INFO: Variant' and within this column are SNPs (reference amino acid, position and alternative amino acid), e.g. 'V179G'. However, within this dataframe are some SNPs where the reference amino acid and alternative are the same (e.g. 'V317V') and i need to be able to remove all these. I'm sure i need to something like below but not sure what the pattern would be as the ref and alt amino acid varies throughout. Is there someway to define the pattern as a pair of the same letter of the alphabet flanking numbers?
patternDel = "[A-Z]\d[A-Z]"
filter = df1['Event Name'].str.contains(patternDel)
df2 = df1[~filter]
I have just selected 10 random rows of my table here but i need to reomve rows: 3, 8, 13, 17
POS ID REF ALT QUAL FILTER INFO:Gene INFO:Variant
CHROM
ChrI 3987274 . t c . . ddn I144T
ChrI 2715280 . a c . . eis L18R
ChrI 4244183 . g c . . embA V317V
ChrI 4247553 . g c . . embB S347T
ChrI 4326938 . a c . . ethA V179G
ChrI 1674048 . g a . . fabG1 g-154a
ChrI 4408102 . c t . . gid G34E
ChrI 8898 . c t . . gyrA L533L
ChrI 5886 . a g . . gyrB D216G
ChrI 1674772 . g a . . inhA A191T
ChrI 2154503 . t c . . katG K537E
ChrI 2288914 . c a . . pncA D110Y
ChrI 801351 . g a . . rplC L181L
ChrI 760402 . t c . . rpoB I199T
ChrI 781373 . g c . . rpsL g-187c
ChrI 1472846 . c a . . rrs c1001a
ChrI 1918497 . g a . . tlyA E186E
Thanks iraun this worked! Could you explain the how the pattern
r'^[A-z]$|^([A-z]).*\1$'
matches with the rows i want to remove? I tried for ages to work out the pattern and the closest i got was"[A-Z]\d[A-Z]"
Thanks!
Sorry the regex is a bit redundant,
r'^([A-z]).*\1$'
will suffice. You can use this to understand it :)