Question

Compare elements of a vector and choose elements from vector to be eliminated in a data set

0

Entering edit mode

3.3 years ago

Bioinfo ▴ 20

My aim is to eliminate dupliccation in dataframe

i wrote a program that determine variables that have the same values in row 17 , next the program put these variables in other data and calculate correlation matrix , i set percentage of this correlation matrix to be 95% it means the program create vector that contain only variables names that correlated more than 95%

for example vector contain name of variables

>Vector

"MT91" "MT92" "MT93"

i want to use this vector to calculate the sum of these variables in all the other lines

for example i have this data :

Name

                            MT91           MT93                    MT92            MT95
QC_G1    70027.02132        95774.1359              100                 24                  
QC_G2    69578.18634         81479.29575            200                 45
QC_G3    69578.18634         87021.95427             10               42545
QC_G4    68231.14338         95558.76738         1000                  425
QC_G5    64874.12936         96780.77245         7000                4545
QC_G6    63866.65780         91854.35304             19                 455
Ctr1         66954.38799         128861.36163         199                2424
Ctr2          97352.55229        101353.25927         155                  344
Ctr3        1252.42545             115683.73755         188                3434
Bti1         81873.96379           112164.14229      1222                  444
Bti2         84981.21914              0.00000              100                 3443  
Bti3         36629.02462           124806.49101        188                 3434
Bti4          0.00000                  109927.26425        122                1000
rt             13.90181                    13.90586           12                     13

So i want to use the vector to calculate the sum of each variables in all the rows except the 17th row , after that i want to keep only the variable that have the highest sum, as you can see it's my vector contain the variables : "MT91" "MT92" "MT93" and it's MT93 that have the highest sum in the 16 rows so i want to eliminate MT91 and MT92

The result will be :

                             MT93                     MT95
QC_G1             95774.1359                    24                  
QC_G2             81479.29575                  45
QC_G3             87021.95427            42545
QC_G4             95558.76738               425
QC_G5             96780.77245            4545
QC_G6             91854.35304             455
Ctr1                  128861.36163          2424
Ctr2                  101353.25927            344
Ctr3                  115683.73755          3434
Bti1                   112164.14229           444
Bti2                   0.00000                   3443  
Bti3                   124806.49101         3434
Bti4                  109927.26425          1000
rt                        3.90586                     13

Note that the vector is generated by the program that will generate a lot of vectors (i'm using for loops) so i don't know the length of the vectors neither the name of the variables in the loops

Please tell me if you want any clarification Thank you

dataframe vectors R statistics • 786 views

ADD COMMENT • link updated 3.3 years ago by bioinformatics2020 ▴ 830 • written 3.3 years ago by Bioinfo ▴ 20

score 0 · Answer 1 · 2021-10-25

You could use colSums() to calculate the sums of the different variables. In your case, you don't want to use the 17th row for calculation, so you would omit it. But first, you can subset your data for the columns in Vector.

data_subset <- subset(data, select = Vector)
data_subset <- data_subset[c(1:16, 18:nrow(data_subset)),]

You then want to highlight the columns/variables that are NOT in Vector.

all_columns <- colnames(data)
subset_columns <- setdiff(all_columns, Vector)

You then can use colSums to calculate the max column and extract the column name in your data-set based upon the subsetted data:

column_sums <- colSums(data_subset)
max_col <- which(column_sums == max(column_sums))
max_col <- names(max_col)

The only caveat is that I'm not sure if there are cases when Vector could contain all of the variables/column names. If that is a possibility, then subset_columns (the difference between the names in Vector and the column names of data_subset) would equal zero. Thus, you would want to add an if/else statement to check:

if (identical(subset_columns, character(0)) {

  subset_columns <- max_col

} else {

  subset_columns <- c(max_col, subset_columns)
}

You can then subset the original data with the max column from Vector and the remaining columns that were not included in Vector (if there were any.)

data <- subset(data, select = subset_columns)

Altogether:

data_subset <- subset(data, select = Vector)
data_subset <- data_subset[c(1:16, 18:nrow(data)),]
all_columns <- colnames(data)
subset_columns <- setdiff(all_columns, Vector)

max_col <- names(which(colSums(data_subset) == max(colSums(data_subset))))

if (identical(subset_columns, character(0)) {

  subset_columns <- max_col

} else {

  subset_columns <- c(max_col, subset_columns)
}

data <- subset(data, select = subset_columns)