Question

how to compare 1st row and 1st column value with rest of the row values in R?

1

Entering edit mode

7.3 years ago

blacktomato27 ▴ 70

hi o all i have data like this

Ind M1  M2  M3  M4  M5
P1  A/A Unused  G/A T/T T/T
P2  T/T A/A A/A A/A G/G
1   T/A A/A G/A T/T G/G
2   T/T A/A G/A T/T T/G
3   T/T A/A G/A T/T T/G
4   T/T A/A G/A A/T G/G
5   T/A A/A G/A A/T T/G

First:i want to check whether P1 and P2 match each other for M1 or not, if they not matched(like P1=A/A,P2=T/T) that column should be selected and write in outfile, like this i want to do for remaining columns till M5 2nd step: Now in outputfile I want to match M1 value of 1st individual to P2, if it match then 1 otherwise it will be 0, M1 value is between them like (A/T ) then it will be H and so on for M4 and M5 3rd step: Then I will sum all the 1's and H' finally my output will look like this

Ind M1  M4  M5      
P1                  
P2              SUM of 1's  SUM OF H
1   H   0   1        1                     1
2   1   0   H        1                     1
3   0   0   H        0                     1
4   1   H   1        2                     1
5   H   H   H        0                     3

I am just beginner in R but till now i am doing this work in excel with sumif and countifs functions but now i can not do this work in excel because of column limitations. can any one help me to sort out this? any help in this regard will be highly appreciated. Here is my file for better understanding dropbox.com/s/ft943u7rf6v9j4b/row%20match.xlsx?dl=0 thanks in advance

R • 7.0k views

ADD COMMENT • link 7.2 years ago by blacktomato27 ▴ 70

0

Entering edit mode

df <- read.csv('biostars.csv', stringsAsFactors = FALSE)
colNames <- c('M1', 'M2', 'M3', 'M4', 'M5')
colOut <- c()
for (colName in colNames) {
  if is.na(df[[colName]][1]) | is.na(df[[colName]][2])) { ### check for non-available data
    print(paste0('This column is not exported but it contains NA values: ', colName))
  } else {
    allelesP1 <- strsplit(df[[colName]][1], '/')[[1]]
    allelesP2 <- strsplit(df[[colName]][2], '/')[[1]]
    if (allelesP1[1] == allelesP1[2] & allelesP2[1] == allelesP2[2]) { ### both parents are homozygous
      colOut <- c(colOut, colName)
      } else {
      print(paste0('This column is not exported but it contains heteroygous parents: ', colName))
    }
  }
}

write.csv(df[, colnames(df) %in% colOut], file = 'outputStep1.csv', row.names = FALSE) # you can choose to not export row names with the 'row.names' argument

Tell me if your first output file is what you like and then I can extend my answer.

Step 2:

  dfOutput <- read.csv('outputStep1.csv', stringsAsFactors = FALSE)
rownames(dfOutput) <- c('P1', 'P2', 1:5)
for (j in 1:ncol(dfOutput)) {
  for (i in 3:7) {
    alleles <- strsplit(dfOutput[i, j], '/')[[1]]
    if (dfOutput[i, j] == dfOutput[2, j]) {
      dfOutput[i, j] <- 1
    } else if (alleles[1] != alleles[2]) {
      dfOutput[i, j] <- 'H'
    } else {
      dfOutput[i, j] <- 0
    }
  }
}

> dfOutput
    M1  M4  M5
P1 A/A T/T T/T
P2 T/T A/A G/G
1    H   0   1
2    1   0   H
3    1   0   H
4    1   H   1
5    H   H   H

C.

ADD REPLY • link 7.3 years ago by biomagician ▴ 410

0

Entering edit mode

Dear Cristian Thanks lot for your help i got this error while executing first part

Error in Ops.factor(df$M1[1], df$M2[2]) : 
  level sets of factors are different

but i solved like this

i <- sapply(df, is.factor)
df[i] <- lapply(df[i], as.character)

Now i succefully got M1 i.e single column but i am getting this error while running 2nd part saying unexpected [ and { symbols. these are the error

> colNames <- c('M1', 'M2', 'M3', 'M4', 'M5')
> for (colName in colNames) {
+   if (df[[colName][1] != df[colName][2]) { # if P1 does not match P2
Error: unexpected '[' in:
"for (colName in colNames) {
  if (df[[colName]["
>   write.csv(df$M1, file = 'outputFile.csv')
>   }
Error: unexpected '}' in "  }"
> }
Error: unexpected '}' in "}"

once again thanks lot for your help Regards

ADD REPLY • link updated 7.3 years ago by GenoMax 148k • written 7.3 years ago by blacktomato27 ▴ 70

0

Entering edit mode

I edited my answer in my first comment. Check it out and tell me where you got. There were mistakes in my first code. For example, when you fetch a column:

df[['column name']]

the i-th value of that column:

df[['column name']][i]

There are brackets missing in your blue code.

ADD REPLY • link 7.3 years ago by biomagician ▴ 410

0

Entering edit mode

Why are columns M2 and M3 discarded? P1 and P2 are different for both.

ADD REPLY • link 7.3 years ago by qwrissie • 0

0

Entering edit mode

Sorry, can you explain the problem more clearly? Do you want to output columns M2 and M3 in the first step? Also, I still haven't sorted out the heterozygote bit. I am on it.

ADD REPLY • link 7.3 years ago by biomagician ▴ 410

0

Entering edit mode

How do you define 'between them' when the parents are not homozygous or the data not available, M3 and M2 respectively.

ADD REPLY • link 7.3 years ago by biomagician ▴ 410

0

Entering edit mode

What does this mean: ' i want to check whether P1 and P2 match for M1 or not' ? Do you only want to export columns where both parents are homozygous?

ADD REPLY • link 7.3 years ago by biomagician ▴ 410

0

Entering edit mode

Dear Cristian Thanks lot for your help in solving this problem if anyone of the parent either P1 or P2 are not homozygous (i.e. like A/T etc) or the data not available i do not want them in output and i will consider them as monomorphic that is the reason M2 and M3 are not in output and I will not consider in my further calculations.

is it possible to give H (heterozygous status) for segregating population for example individual 1 and 5 in M1 column? because i will count them and use this in next calculations.

i run your code and it is generating individual files to each marker like M1. csv, M2.csv etc is it possible to get them all in one file? and i am getting following error Warning messages: 1: In if (dfOutput[i, ] == dfOutput[2, ]) { : the condition has length > 1 and only the first element will be used like this i am getting 5 erors i think one per marker once again i would like to say big thanks for you help Regards

ADD REPLY • link 7.3 years ago by blacktomato27 ▴ 70

0

Entering edit mode

The code above is now exporting the 3 columns into a single file.

ADD REPLY • link 7.3 years ago by biomagician ▴ 410

0

Entering edit mode

It's also marking the heterozygotes. I think table 2 is as you wish.

ADD REPLY • link 7.3 years ago by biomagician ▴ 410

4

Entering edit mode

7.3 years ago

Friederike 9.0k

Just to show you that there are always numerous ways to do the same stuff, here's a solution that feels a little more like typical R code to me -- nothing wrong with Christian's answer though, consider my example for educational purposes.

You will find that R table formats will differ from what you find useful in Excel. Generally speaking, tables will be longer, rather than wide. So my first lines of codes mostly deal with formatting, you can easily do that in Excel if you prefer.

library(reshape2)
library(magrittr) # this is a useful package that introduces %>% for stringing commands together

# I saved your table above in a file called "test.txt"
test <- read.table("~/Downloads/test.txt", header = TRUE, stringsAsFactors = FALSE)

test
  Ind  M1     M2  M3  M4  M5
1  P1 A/A Unused G/A T/T T/T
2  P2 T/T    A/A A/A A/A G/G
3   1 T/A    A/A G/A T/T G/G
4   2 T/T    A/A G/A T/T T/G
5   3 T/T    A/A G/A T/T T/G
6   4 T/T    A/A G/A A/T G/G
7   5 T/A    A/A G/A A/T T/G

I'll make a separate table just for the parents.

parents <- as.data.frame(t(test[1:2,-1]))
names(parents) = c("P1","P2")

# replacing "unused" with NA because NA is a native identifier of missing data
parents$P1 <- gsub("Unused", NA, parents$P1)
parents$P2 <- gsub("Unused", NA, parents$P2)

# now, I'm adding separate columns for the individual alleles of P1 and P2
# gsub has the syntax: gsub(pattern to be replaced, replacement, string to operate on)
parents$P1.all1 <- gsub("\\/.","", parents$P1)  # this replaces the / and the letter after it with nothing
parents$P1.all2 <- gsub(".\\/","", parents$P1) # this replaces the / and the letter BEFORE it with nothing

parents$P2.all1 <- gsub("\\/.","", parents$P2) 
parents$P2.all2 <- gsub(".\\/","", parents$P2) 

parents
     P1  P2 P1.all1 P1.all2 P2.all1 P2.all2
M1  A/A T/T       A       A       T       T
M2 <NA> A/A    <NA>    <NA>       A       A
M3  G/A A/A       G       A       A       A
M4  T/T A/A       T       T       A       A
M5  T/T G/G       T       T       G       G
>

I also make a separate table for the offspring:

offspring <- test[-c(1:2),] # removing the lines corresponding to the parents
offspring <- reshape2::melt(offspring, id.vars = "Ind", variable.name = "type") # changing the format using function melt of the library reshape2

head(offspring)
   Ind type value
1    1   M1   T/A
2    2   M1   T/T
3    3   M1   T/T
4    4   M1   T/T
5    5   M1   T/A
6    1   M2   A/A

# like for the parents, I add separate columns for allele 1 and 2
offspring$all.1 <- gsub("\\/.","", offspring$value) 
offspring$all.2 <- gsub(".\\/","", offspring$value)

Now, let's get to work.

First, identify the individual types (M1 to M5) that are relevant based on your first criterion: P1 and P2 should be homozygous and P1 should not be the same as P2.

relevant_offspring <-  subset(parents, ((P1.all1 == P1.all2) & (P2.all1 == P2.all2)) & P1 != P2 ) %>% rownames

# use those names to filter the offspring table
offspring <- subset(offspring, type %in% relevant_offspring)

Now, we need to combine the parent and offspring info again.

off_par <- merge(offspring, parents, by.x = "type", by.y = "row.names")
head(off_par)
   type Ind value all.1 all.2  P1  P2 P1.all1 P1.all2 P2.all1 P2.all2
1    M1   1   T/A     T     A A/A T/T       A       A       T       T 
2    M1   2   T/T     T     T A/A T/T       A       A       T       T 
3    M1   3   T/T     T     T A/A T/T       A       A       T       T 
4    M1   4   T/T     T     T A/A T/T       A       A       T       T 
5    M1   5   T/A     T     A A/A T/T       A       A       T       T 
6    M4   1   T/T     T     T T/T A/A       T       T       A       A

Let's add a new column with your code for match, no match, heterozygous. I'm going to use the ifelse() function that has the following syntax: ifelse( condition, what should happen if condition is TRUE, what should happen if condition is FALSE).

off_par$code <- with(off_par, ifelse(value == P2, "1", # homozygous match
                                ifelse( all.1 == all.2, "0", # if it's not a match, but still homozygous, we put 0
                                    ifelse( (all.1 == P1.all1 | all.1 == P2.all1) & (all.2 == P1.all1 | all.2 == P2.all1) , "H", NA )
                                    )
                                )
                            )

The last condition seems a bit complicated, but I just want to make sure we're really only setting an "H" if we make sure that we have a proper heterozygous. There should be an NA if the individual has a completely inexplicable genotype.

The result looks similar to what Christian had once we change the format again:

off_par[c("type", "Ind","code")] %>% reshape2::dcast(., Ind~type, value.var = "code")
  Ind M1 M4 M5
1   1  H  0  1
2   2  1  0  H
3   3  1  0  H
4   4  1  H  1
5   5  H  H  H

ADD COMMENT • link 7.3 years ago by Friederike 9.0k

0

Entering edit mode

Dear Friederike Good Morning Hope you are doing good, Thanks lot for your help in solving my query. i have small query again to guess parents marker genotype based on marker data of their progeny because some times i am getting NA'a and Het (A/G,A/T,A/C,G/A,G/T,G/C, T/A,T/C,T/G,C/A,C/T,C/G) in second parent i.e. my P2 and it is making problem for me to match. i would like to know is it possible to guess my P2 for those markers showing NA's or Het calls?. Please check my examples here for better understanding. https://www.dropbox.com/scl/fi/rh6goxr8f9odkpsjuxlcv/imputation.xlsx?dl=0&rlkey=etmbtq9t8c9ba2lctk3zsiw1c Thanks in advance

ADD REPLY • link 4.4 years ago by blacktomato27 ▴ 70

0

Entering edit mode

7.3 years ago

blacktomato27 ▴ 70

Dear Both Cristian and Friederike Good Morning Thanks lot for your help and time spent to solve my problem. Both of your code working perfectly and I am getting results as per my wish Once again thanks Regards,

ADD COMMENT • link 7.3 years ago by blacktomato27 ▴ 70

0

Entering edit mode

7.2 years ago

blacktomato27 ▴ 70

Dear Both Cristian and Friederike Good Afternoon I have one more small query on your answer, will this code work if i want to repeat this analysis for next set like (different parents and individuals) like this Ind M1 M2 M3 M4 M5 1 P1 A/A Unused G/A T/T T/T 2 P2 T/T A/A A/A A/A G/G 3 1 T/A A/A G/A T/T G/G 4 2 T/T A/A G/A T/T T/G 5 3 T/T A/A G/A T/T T/G 6 4 T/T A/A G/A A/T G/G 7 5 T/A A/A G/A A/T T/G 8 P3 A/A T/T G/A T/T T/T 9 P4 T/T A/A A/A A/A G/G 10 1 T/A A/A G/A T/T G/G 11 2 T/T A/A G/A T/T T/G 12 3 T/T A/A G/A T/T T/G 13 4 T/T A/A G/A A/T G/G 14 5 T/A A/A G/A A/T T/G I want to repeat analysis with same steps like above but this time parents will be different (now is P3 and P4) and their progeny also different. Thanking you very much regards

ADD COMMENT • link 7.2 years ago by blacktomato27 ▴ 70

1

Entering edit mode

can you format the table? I don't see the problem yet. did you try the code and ran into an issue?

ADD REPLY • link 7.2 years ago by Friederike 9.0k

0

Entering edit mode

7.2 years ago

biomagician ▴ 410

Hi,

Can I recommend this course? http://cambiotraining.github.io/r-intro/

ADD COMMENT • link 7.2 years ago by biomagician ▴ 410

0

Entering edit mode

7.2 years ago

blacktomato27 ▴ 70

Dear Friederike Good Morning

Thanks lot for your time in helping to solve my query. I will follow your recommendation to make separate file for ever separate parent pair and it will easy to understand and no confusion. @cristian thanks for your recommendation to this useful course in R Once thanks lot for both of you Best regards

ADD COMMENT • link 7.2 years ago by blacktomato27 ▴ 70

score 1 · Accepted Answer · 2017-09-28

1

Entering edit mode

7.2 years ago

blacktomato27 ▴ 70

Dear Friederike

Thank you very much for your kind and prompt reply to my request. I just modified your code little bit, till last step it is working fine but finally I am unable to format off_par. Now I am sharing my data file (https://www.dropbox.com/s/tysf6p3xdin9lop/biostars.xlsx?dl=0) and code I tried here (https://www.dropbox.com/s/l4ez5ht7i3o6xmd/Friederike%20Code.txt?dl=0) here for better understanding. With kind regards,

ADD COMMENT • link 7.2 years ago by blacktomato27 ▴ 70

0

Entering edit mode

sorry, I'm still not fully on board. in your example data, you have two tables for data, one with P1 and P2 in the first two rows and another table with P3 and P4 in the first two rows. For your analysis, do you want to _combine all four parents_ or do you want to run the same analysis that we gave you above, but all that's changing is the actual _name_ of the parents, i.e., instead of asking that P1 and P2 be homozygous, you start by asking that P3 and P4 are homozygous and so on. If you were to combine them, the initial test would change. If you simply want to re-run the same analysis, I recommend you make a separate table for every pair of parents and put the code I gave before into a function:

fab_analysis <- function(file){
  # I saved your table above in a file called "test.txt"
  test <- read.table(file, header = TRUE, stringsAsFactors = FALSE)

  # making separate object just for parents
  parents <- as.data.frame(t(test[1:2,-1]))
  names(parents) = c("P1","P2")

  # replacing "unused" with NA because NA is a native identifier of missing data
  parents$P1 <- gsub("Unused", NA, parents$P1)
  parents$P2 <- gsub("Unused", NA, parents$P2)

  # now, I'm adding separate columns for the individual alleles of P1 and P2
  # gsub has the syntax: gsub(pattern to be replaced, replacement, string to operate on)
  parents$P1.all1 <- gsub("\\/.","", parents$P1)  # this replaces the / and the letter after it with nothing
  parents$P1.all2 <- gsub(".\\/","", parents$P1) # this replaces the / and the letter BEFORE it with nothing

  parents$P2.all1 <- gsub("\\/.","", parents$P2) 
  parents$P2.all2 <- gsub(".\\/","", parents$P2) 

  offspring <- test[-c(1:2),] # removing the lines corresponding to the parents
  offspring <- reshape2::melt(offspring, id.vars = "Ind", variable.name = "type") # changing the format using function melt of the library reshape2
  offspring$all.1 <- gsub("\\/.","", offspring$value) 
  offspring$all.2 <- gsub(".\\/","", offspring$value) 

  # identify offspring where both parents are homozygous, but not the same
  relevant_offspring <-  subset(parents, ((P1.all1 == P1.all2) & (P2.all1 == P2.all2)) & P1 != P2 ) %>% rownames
  offspring <- subset(offspring, type %in% relevant_offspring)

  off_par <- merge(offspring , parents, by.x = "type", by.y = "row.names")

  off_par$code <- with(off_par, ifelse(value == P2, "1", # homozygous match
                                       ifelse( all.1 == all.2, "0", # if it's not a match, but still homozygous, we put 0
                                               ifelse( (all.1 == P1.all1 | all.1 == P2.all1) & (all.2 == P1.all1 | all.2 == P2.all1) , "H", NA )
                                       )
  )
  )

  return(off_par[c("type", "Ind","code")] %>% reshape2::dcast(., Ind~type, value.var = "code"))
}

This function expects exactly the type of table you posted in your very first post. It does not care, however, what the names of the parents or the children are, i.e., whether there's P1 and P2 or P4 and P5 or NewParent1 and NewParent2, it won't matter as long as the _parents are always in the first two rows_. It will also always compare the genotypes of the children to whatever parent is given in the second row of the table.

For the example, I randomly renamed both children and parents and the results did not change.

> fab_analysis("~/Downloads/test.txt")
  Ind M10 M4 M5
1   1   H  0  1
2   2   1  0  H
3   3   1  0  H
4   4   1  H  1
5   5   H  H  H

ADD REPLY • link 7.2 years ago by Friederike 9.0k