Finding important genes in a big dataset (top 20-top30)
1
0
Entering edit mode
3.0 years ago
sadaf ▴ 20

I have a big dataset (7000) which has categorical variables(gene names) and numerical features, I want to find important genes among others based on their features Based on age(numeric) and time of exposure to harmful substances(numeric), for example the higher the age and time of the exposure, the higher the rank of the genes. In addition, I have also the data of up&down regulation of each gene.

Could anyone suggest me some methods (rather than RRA) than can be suitable for this purpose?

bigdata R randomforest python Rankgenes • 1.3k views
ADD COMMENT
0
Entering edit mode

Important how? That separate samples? It's not clear how you're defining "important" here or what sort of data you actually have.

ADD REPLY
0
Entering edit mode

Based on age(numeric) and time of exposure to harmful substances(numeric), for example the higher the age and time of the exposure, the higher the rank of the genes. In addition, I have also the data of up&down regulation of each gene.

ADD REPLY
0
Entering edit mode

I do not believe this question can be meaningfully answered given the description provided. You mention expression in the final sentence, otherwise we wouldn't even know that.

Be clear. You have (RNA micro arrays; bulk RNA seq; scRNA seq) on 7000 (people, mice) for a (complex, mendelian, somatic) disease. I have 4 (numerical) covariates and ....

otherwise I can assure you that you are unlikely to get the best answer you can.

ADD REPLY
2
Entering edit mode
3.0 years ago

Maybe you can just order the table by age, exposure and select either UP or DOWN. For example you can use dplyr (from the CRAN--> install.packages("dplyr")), on R, to do that. Something like:

require(dplyr)

table -> data.table::fread("path/to/your/table/file.csv")  # or whatever method to import your table

new_table ->
table %>%
    # age and exposure are the names of your columns (without quotation marks)
    # this command will sort in descending order by this two variables
arrange(desc(age), desc(exposure)) %>%
   # Now let's filter the table by gene expression, where gene_expression is the name of the corresponding column (without ")
filter(gene_expression == "UP")


#select only the top 30
top_30_genes -> new_table[1:30,]$nameColumnGeneID
ADD COMMENT

Login before adding your answer.

Traffic: 2874 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6