Question

Group rows in table based on column

0

Entering edit mode

2.4 years ago

pramirez ▴ 10

Hello,

I need to group the rows in a table based on a set of conditions. Here is the table: enter image description here

And I need to group it based on the following conditions:

I want to identify the rows that have the same taxonomy.
If the rows have the same taxonomy but difference of the value for column11 is more than 10^5 from the other columns, I want to drop the row. For example, in the table above, I would drop row 5 (tomato) because the value for column11 is more than five orders of magnitude different than the value from column11 for the other tomatoes.
I want to group the remaining rows by taxonomy and keep the mean of column11 as the final column 11 value per row. Here is the desired result:

enter image description here

Here is the code I have so far:

import numpy as np
import pandas as pd

df = pd.read_csv('data.csv', sep='\t', decimal='.')


l= list(df.iloc[:,0])
q = len(l) - 1
r=[False]
tax = df.taxonomy
m = 0
for k in range(0,q):
    r.append(l[k]==l[k+1])
    print('k')
    print(l[k])
    print('k+1')
    print(l[k+1])
    print(r)
    dif = df.column11[k] - df.column11[k+1]
    if ((r == 'True') AND (dif > 100000)):
         df.drop(k)

The code finds the rows that have the same taxonomy, but does not execute the "if" loop. I'd be grateful if I could have some help.

Thanks!

df python dataframe pandas • 627 views

ADD COMMENT • link updated 2.4 years ago by BlastedBadger ▴ 160 • written 2.4 years ago by pramirez ▴ 10

score 1 · Answer 1 · 2022-07-22

First, your code has syntax errors. Can you change the "if" line with:

if (r[-1] and dif > 100000):

and check again?

Similarly, print('k') will print the string k, not the value of k. Use print(k) !

Then, it's seems to me that there are conceptual errors as well. You basically want to remove outlier values from groups. Maybe search methods that are designed to spot outliers, like removing values that are distant from the mean by more than 2 standard deviations? This method I suggest might not be appropriate if you have few individuals per group though.

Other conceptual error: the check l[k]==l[k+1] will be False the first time you see "Tomato", but you also want to check that first value against others, right?

Instead of doing it manually, I encourage you to do a python tutorial and a pandas tutorial. In pandas, you can group rows that way:

df.groupby('taxonomy')

which means you could compute mean and standard deviations of the whole group, broadcasted back on each row using:

groupmeans = df.groupby('taxonomy')['column11'].transform('mean').rename('mean11')
groupstds = df.groupby('taxonomy')['column11'].transform('std').rename('std11')

and finally make your checks

indivstats = pd.concat((df, groupmeans, groupstds), axis=1)

keepme = (indivstats.column11 - indivstats.mean11).abs() / indivstats.std11 > 2

To terminate, your questions on this Q&A site need to be formatted more appropriately if you want help: please do not post images of dataframe! Post a "reproducible" example, that is the code required to generate the dataframe. Example:

df = pd.DataFrame([
    ['mouse', 3.02e-90, 'Animal', 'Muridae'],
    ['mouse', 3.02e-89, 'Animal', 'Muridae'],
    ['tomato', 3.02e-90, 'Plant', 'Solanaceae'],
    # etc
    ],
    columns=['taxonomy', 'column11', 'Kingdom', 'Family'])