Question

Find consecutive duplicate strings in rows from df

0

Entering edit mode

2.4 years ago

pramirez ▴ 10

I have a list of annotated protein sequences with their corresponding IDs. I am trying to create a function that detects consecutive duplicate entries in the first column (protein ID) and returns false or true. I tried this:

df = pd.read_csv('taxonomy.tsv', sep='\t', decimal='.')
value = df.iloc[:, 1].diff().lt(0)
print (value)

I obtain the following error:

TypeError: unsupported operand type(s) for -: 'str' and 'str'

Do you know how can I fix it?

Thank you.

python metagenomics pandas • 1.3k views

ADD COMMENT • link updated 2.4 years ago by zorbax ▴ 650 • written 2.4 years ago by pramirez ▴ 10

score 1 · Answer 1 · 2022-07-18

1

Entering edit mode

2.4 years ago

raphael.B ▴ 520

l= list(df.iloc[:,1])
r=[False]
for k in range(1,len(l)):
    r.append(l[k]==l[k-1])
print(r)

This should do the trick

ADD COMMENT • link 2.4 years ago by raphael.B ▴ 520

0

Entering edit mode

Hi! Thanks! I tried your method and obtained the following error: TypeError: '(slice(None, None, None), 1)' is an invalid key

ADD REPLY • link 2.4 years ago by pramirez ▴ 10

0

Entering edit mode

sorry, I forgot the iloc.

ADD REPLY • link 2.4 years ago by raphael.B ▴ 520

score 0 · Answer 2 · 2022-07-18

0

Entering edit mode

2.4 years ago

zorbax ▴ 650

it'll return all duplicate rows back

df[df.duplicated(['protein ID'], keep=False)]['protein ID']

ADD COMMENT • link 2.4 years ago by zorbax ▴ 650