Question

Filter Protein Expression Data

1

Entering edit mode

2.6 years ago

mropri ▴ 160

Hi guys,

I have protein expression data in a data frame df, where the proteins are rows and columns are sample ids with abundance values. Such as:

          Sample 1         Sample 2           Sample 3
RPH3A
CA11
AIFM1

I want to keep those proteins that I have data for in at least 50 % of the samples. Any help would be appreciated?

Filtering Proteomics • 787 views

ADD COMMENT • link updated 2.6 years ago by Ram 44k • written 2.6 years ago by mropri ▴ 160

score 1 · Accepted Answer · 2022-05-31

You can calculate the percentage of NAs in each row and add that as a column to your data frame in R. Then you can use awk to remove rows that have more than 50% NAs.

In R:

count_na_func = function(x) {
sum(is.na(x))
}

df$na_percent = (apply(df, 1, count_na_func))/(ncol(df) - 1)
write.table(df, file = "datana.tsv", row.names=FALSE, sep="\t")

You can use some practice data to make sure that the na_percent column is computed correctly in R.

In bash:

awk '{if($12<=0.5){print}}' datana.tsv > newdata.tsv

Note that R is 1-indexed, while bash is 0-indexed. na_percent was in column 12 of my data frame, but it will probably be in a different column in yours, so you can substitute that column number for 12 in the awk code.