Question

Finding important genes in a big dataset (top 20-top30)

0

Entering edit mode

3.4 years ago

sadaf ▴ 20

I have a big dataset (7000) which has categorical variables(gene names) and numerical features, I want to find important genes among others based on their features Based on age(numeric) and time of exposure to harmful substances(numeric), for example the higher the age and time of the exposure, the higher the rank of the genes. In addition, I have also the data of up&down regulation of each gene.

Could anyone suggest me some methods (rather than RRA) than can be suitable for this purpose?

bigdata R randomforest python Rankgenes • 1.4k views

ADD COMMENT • link updated 3.4 years ago by LauferVA 4.7k • written 3.4 years ago by sadaf ▴ 20

0

Entering edit mode

Important how? That separate samples? It's not clear how you're defining "important" here or what sort of data you actually have.

ADD REPLY • link 3.4 years ago by jared.andrews07 ★ 18k

0

Entering edit mode

Based on age(numeric) and time of exposure to harmful substances(numeric), for example the higher the age and time of the exposure, the higher the rank of the genes. In addition, I have also the data of up&down regulation of each gene.

ADD REPLY • link 3.4 years ago by sadaf ▴ 20

0

Entering edit mode

I do not believe this question can be meaningfully answered given the description provided. You mention expression in the final sentence, otherwise we wouldn't even know that.

Be clear. You have (RNA micro arrays; bulk RNA seq; scRNA seq) on 7000 (people, mice) for a (complex, mendelian, somatic) disease. I have 4 (numerical) covariates and ....

otherwise I can assure you that you are unlikely to get the best answer you can.

ADD REPLY • link 3.4 years ago by LauferVA 4.7k

score 2 · Answer 1 · 2021-12-17

Maybe you can just order the table by age, exposure and select either UP or DOWN. For example you can use dplyr (from the CRAN--> install.packages("dplyr")), on R, to do that. Something like:

require(dplyr)

table -> data.table::fread("path/to/your/table/file.csv")  # or whatever method to import your table

new_table ->
table %>%
    # age and exposure are the names of your columns (without quotation marks)
    # this command will sort in descending order by this two variables
arrange(desc(age), desc(exposure)) %>%
   # Now let's filter the table by gene expression, where gene_expression is the name of the corresponding column (without ")
filter(gene_expression == "UP")


#select only the top 30
top_30_genes -> new_table[1:30,]$nameColumnGeneID