Filter Protein Expression Data
1
1
Entering edit mode
2.5 years ago
mropri ▴ 160

Hi guys,

I have protein expression data in a data frame df, where the proteins are rows and columns are sample ids with abundance values. Such as:

          Sample 1         Sample 2           Sample 3
RPH3A
CA11
AIFM1

I want to keep those proteins that I have data for in at least 50 % of the samples. Any help would be appreciated?

Filtering Proteomics • 748 views
ADD COMMENT
1
Entering edit mode
2.5 years ago
Jeremy ▴ 930

You can calculate the percentage of NAs in each row and add that as a column to your data frame in R. Then you can use awk to remove rows that have more than 50% NAs.

In R:

count_na_func = function(x) {
sum(is.na(x))
}

df$na_percent = (apply(df, 1, count_na_func))/(ncol(df) - 1)
write.table(df, file = "datana.tsv", row.names=FALSE, sep="\t")

You can use some practice data to make sure that the na_percent column is computed correctly in R.

In bash:

awk '{if($12<=0.5){print}}' datana.tsv > newdata.tsv

Note that R is 1-indexed, while bash is 0-indexed. na_percent was in column 12 of my data frame, but it will probably be in a different column in yours, so you can substitute that column number for 12 in the awk code.

ADD COMMENT
1
Entering edit mode

Thank you. This works pefectly.

ADD REPLY

Login before adding your answer.

Traffic: 2754 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6