How to remove low abundance and less prevalent data from my dataset?
0
0
Entering edit mode
4.0 years ago

Hi friends!!! I have a relative abundance table in .tsv format where samples are in columns and rows contain the features (pathways). Something like this reproducible example. (Sorry, I am unable to put the table here in question, thus inserting the link.)

Now, I want to keep features (i.e. pathways) that are present with abundance >0.0001 and present in at least 10% of samples. Can you please tell me how can I do that? I mean can you please suggest a bash command to achieve the purpose?

Thanks, dpc

metagenome bash • 1.5k views
ADD COMMENT
0
Entering edit mode

Hi friend. This would be easier to do in R.

present with abundance >0.0001

This is not well-defined. Is this mean abundance across all samples?

ADD REPLY
0
Entering edit mode

No, no Kevin. Actually if you add each column you will see a total of 1.00. It means each cell of a column shows relative abundance of corresponding row , i.e. pathway. There is no need to do calculation. The values in a cell itself denote relative abundance, so there is no need to do any calculation or something. Just selective rows to be kept or removed. I want to keep only those rows where at least 10% of its cells contains value >0.0001.For example, say, I have 2 rows and 20 columns. The first row contains cells where 2 cells (i.e. 10% of the all cells) have values > 0.0001 but other 18 cells have values <0.0001. We will keep these two rows in our output table. Suppose, the second row has one value >0.0001. This row will be removed because at least 10% of all cells i.e. at least 2 cells should contain values > 0.0001. Thanks

ADD REPLY
0
Entering edit mode

Just create a boolean matrix and then count the number of TRUE and FALSE per row. Something like:

apply(mat > 0.0001, 1, table)

..then, go from there.

Even better, this will return a single boolean vector of rows (genes) to keep that have 10% samples with values > 0.0001

apply(mat > 0.0001, 1, function(x) table(x)['TRUE']) > (ncol(mat)/100) * 10
ADD REPLY

Login before adding your answer.

Traffic: 1535 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6