Convert numeric alleles into letters
3
1
Entering edit mode
7.9 years ago

Given the files in HAP/LEGEND/SAMPLE format used by SHAPEI2, I intend to feed them into REHH2 R package. However, the alleles in HAP file is coded as 0, 1. The REHH hyplotype input file has alleles coded as A, C, T, and G. Is there any solution to convert numeric alleles into letters? Thank you in advance for ideas, solutions and questions.

GWAS SHAPEIT2 REHH • 5.1k views
ADD COMMENT
0
Entering edit mode

I followed your link to get a better idea about your data. So what you have is e.g.:

0 0 1 0 0 0 1 1
0 1 1 0 0 1 0 1
0 1 1 0 1 1 1 1

Which you want to convert back to:

               SNP1  SNP2  SNP3
CEU1   hap1     A     T     A
CEU1   hap2     A     C     T
CEU2   hap1     G     C     T
CEU2   hap2     A     T     A
GBR1   hap1     A     T     T
GBR1   hap2     A     C     T
YRI1   hap1     G     T     T
YRI1   hap2     G     C     T

Do you also know the reference and alternative allele per position?

ADD REPLY
0
Entering edit mode

Haven't tried the new REHH2 R package, but selscan https://github.com/szpiech/selscan was much faster than the previous version. Also, it can read the output files from SHAPEIT directly without any modifications.

ADD REPLY
1
Entering edit mode
7.9 years ago

Thanks a lot for the comments! I figured out the solution. In R the haplotype numerical file (hap) can be converted into literal given the LEGEND (map) as follow:

library(data.table)
haps <- haps[, lapply(.SD, function(x) {
  ifelse(x == 0, map[, a0], map[, a1])
})]
ADD COMMENT
0
Entering edit mode

Hi,

What is haps, .SD and map in your code?

ADD REPLY
0
Entering edit mode

map is data table containing data from *.legend file. haps is data table containing data from *.haplotypes file.

*.legend and *.haplotypes are produced by SHAPEIT2.0 as a result of phasing.

.SD is a symbol used specifically in data.table R package. It means "Subset of Data.table". In the code above it results in applying function(x) to each column of hap.

ADD REPLY
0
Entering edit mode

Thank you for your explanation. I have never used SHAPEIT before. I have a data set in the following way(it is just small part of my dataset) in R:

        134788398   134792091   134793655   134794177

NA06989_A   A   A   G   G
NA06989_B   A   A   G   G
NA10850_A   G   A   G   G
NA10850_B   G   G   A   G
NA06984_A   G   G   A   G
NA06984_B   A   A   G   G
NA11917_A   G   A   A   T
NA11917_B   A   A   G   G

I need to obtain data frames etc. which have same content with hap and legend file to use in R. I hope I can explain clearly. Do you have any idea how can I figure out it? Thank you very much.

ADD REPLY
0
Entering edit mode

Ok, great! The genotypes in HAP file are classified into first and second alleles (a0, a1). In order to convert the data set you provided into HAP format, we need to know the classifier. The other words, the rule by which the alleles are marked as "first" or "second" ones should be provided.

Sometimes, the classification is made on the base of allele frequency in the population. For example, the most two frequent alleles become "first" and "second" one, respectively.

Any ideas on which allele will be the first or second? If there is no clue, describe the final step of your research. I'll try to figure out. Feel free to send me a personal message.

ADD REPLY
0
Entering edit mode

In my case, I think it doesn't matter. I am trying to order haplotypes by the focal SNP. As a result, I will try to obtain a graph which identifies similarities between populations. For example, I can code in that way:

A and T ==> 0 G and C ==> 1

ADD REPLY
1
Entering edit mode
 ## Script converts input data set into HAP and LEGEND formats.
## The file formats are described here https://mathgen.stats.ox.ac.uk/genetics_software/shapeit/shapeit.html#haplegsample
## The alleles are classified according to the rule: A-T ==> 0, G-C ==> 1
## We assume that there is no missing genotypes

# Toy data set
data = "
134788398   134792091   134793655   134794177

NA06989_A   A   A   G   G
NA06989_B   A   A   G   G
NA10850_A   G   A   G   G
NA10850_B   G   G   A   G
NA06984_A   G   G   A   G
NA06984_B   A   A   G   G
NA11917_A   G   A   A   T
NA11917_B   A   A   G   G"

# Read genotypes
hap <- read.table(text = data)

# Get a pair of unique genotypes for each SNP
legend <- apply(hap, 2, function(x) {
  unique(x)
}) 

# Since HAP format assumes SNPs to be in rows and alleles in columns,
# transpose the data frame with genotypes
legend <- t(legend)

# Classify genotypes
legend <- apply(legend, 1, function(x) {
  if(x[1] %in% c("A", "T")) {
    a0 <- x[1]
    a1 <- x[2]}
  else {
    a0 <- x[2]
    a1 <- x[1]
  }
  c(a0, a1)
  })

# Transpose 
legend <- t(legend)

# Get SNP positions
pos <- t(read.table(text = data, nrows = 1))

# Finilize LEGEND 
legend <- data.frame(id = paste0("snp", 1:nrow(pos)),
                     position = pos,
                     a0 = legend[, 1],
                     a1 = legend[, 2])

# Save legend as space delimited with header    
write.table(legend, "data.legend", sep = " ", col.names = T, row.names = F, quote = F)

# Code genotypes 
hap <- apply(hap, 1, function(x) ifelse(x %in% c("A", "T"), 0, 1)) 

# Save hap as space delimited without column and row names    
write.table(hap, "data.hap", sep = " ", col.names = F, row.names = F)
ADD REPLY
0
Entering edit mode
 Thank you very much your great help. I will try your solution. After you ask me about clasification,
 I couldn't be sure about my classification rule. I found a legend file of my prof. 
You can see a small part of this file. Do you have any idea about his classification rule according to the following list?

rs              position        X0  X1
rs11888260  136272200   A   G
rs313518    136272822   C   T
rs313519    136273249   A   C
rs4954281   136273537   G   T
rs313526    136279258   C   T
rs313528    136279601   A   G
rs7570283   136281439   C   T
rs16831946  136281874   A   G
ADD REPLY
0
Entering edit mode

Your are welcome! It is difficult to figure out the allele classification looking at the data. Obviously, it is either task specific or not. I would check next step of pipeline. Which tool, package will further consume the data? I would read the manual of this tool regarding the requirements for the input data. I would also check the algorithm, math behind this tool.

ADD REPLY
0
Entering edit mode

How do "Classify genotypes" and "Code genotypes" parts should look like, if I want to code randomly?

ADD REPLY
0
Entering edit mode

Hi Gennady,

I am trying to run following code but it gives an error.

result<- sorted[, lapply(.SD, function(x) {
  ifelse(x == 0, legend$a0, legend$a1)
})]

Error in .subset(x, j) : invalid subscript type 'list'

How can I fix it?

Thanks

ADD REPLY
0
Entering edit mode

Could you provide a reproducible code? It should reflect how sorted and legend variable were defined. Paste also examples of data sets loaded into sorted and legend.

ADD REPLY
0
Entering edit mode

I am not sure how I can provide a reproducible code but I hope the structures of the variables help you.

> str(sorted)
'data.frame':   34 obs. of  683 variables:
 $ 134788398: int  0 0 0 1 0 0 0 1 1 0 ...
 $ 134792091: int  0 0 0 1 0 0 0 1 0 0 ...
 $ 134793655: int  0 0 0 1 0 0 0 1 1 0 ...
 $ 134794177: int  0 0 0 0 0 0 0 0 1 0 ...
 $ 134796050: int  0 0 0 0 0 0 0 0 0 0 ...
 $ 134796375: int  0 0 0 1 0 0 0 1 0 0 ...
 $ 134796448: int  0 0 0 1 1 0 0 1 1 0 ...
 $ 134796609: int  0 0 0 0 0 0 0 0 1 0 ...

> str(legend)
'data.frame':   683 obs. of  4 variables:
 $ rsID        : Factor w/ 683 levels "rs1016269","rs10164484",..: 627 329 662 367 144 143 671 670 94 50 ...
 $ position_b36: int  134788398 134792091 134793655 134794177 134796050 134796375 134796448 134796609 134797415 134798914 ...
 $ a0          : Factor w/ 4 levels "A","C","G","T": 1 1 3 3 2 1 2 3 2 1 ...
 $ a1          : Factor w/ 4 levels "A","C","G","T": 3 3 1 4 1 3 1 1 4 3 ...
ADD REPLY
0
Entering edit mode

Sorted:

    134788398   134792091   134793655   134794177   134796050
NA06984_B   0   0   0   0   0
NA07055_B   0   0   0   0   0
NA11917_B   0   0   0   0   0
NA10850_B   1   1   1   0   0
NA07022_B   0   0   0   0   0
NA11992_B   0   0   0   0   0
NA12282_A   0   0   0   0   0
NA12282_B   1   1   1   0   0
NA11917_A   1   0   1   1   0

Legend:

    rsID    position_b36    a0  a1
1   rs7575818   134788398   A   G
2   rs3791281   134792091   A   G
3   rs876887     134793655  G   A
4   rs4953912   134794177   G   T
5   rs1371128   134796050   C   A
ADD REPLY
1
Entering edit mode
7.2 years ago

sorted has class data.frame. The above code I used to convert numerical haplotypes into lateral ones works if this variable has class data.table. Not all data.table syntax can be applied for data.frame. Possible quick fix:

library(data.table)

# Convert data.frame into data.table 
sorted <- data.table(sorted)
legend <- data.table(legend)

# Convert haplotypes 
result <-  sorted[, lapply(.SD, function(x) {  
    ifelse(x == 0, legend[, a0], legend[, a1]) 
})]
ADD COMMENT
0
Entering edit mode

I see thank you very much for your interest.

It works but it assigns 1,2,3,4 instead of A,T,G,C according to legend. Why?

ADD REPLY
1
Entering edit mode
7.2 years ago

Cause legend$a0 and legend$a1 are factors. Convert them into character class.

require(data.table)

# Initilize
data1 <- "
rsID    position_b36    a0  a1
rs7575818   134788398   A   G
rs3791281   134792091   A   G
rs876887     134793655  G   A
rs4953912   134794177   G   T
rs1371128   134796050   C   A"

data2 <- "
134788398   134792091   134793655   134794177   134796050
NA06984_B   0   0   0   0   0
NA07055_B   0   0   0   0   0
NA11917_B   0   0   0   0   0
NA10850_B   1   1   1   0   0
NA07022_B   0   0   0   0   0
NA11992_B   0   0   0   0   0
NA12282_A   0   0   0   0   0
NA12282_B   1   1   1   0   0
NA11917_A   1   0   1   1   0
"

# Load data sets
legend <- read.table(text = data1, header = T)
sorted <- read.table(text = data2, header = T)

# Convert data.frame into data.table 
sorted <- data.table(sorted)
legend <- data.table(legend)

# Convert alleles into character
legend$a0 <- as.character(legend$a0)
legend$a1 <- as.character(legend$a1)

# Convert haplotypes 
result <-  sorted[, lapply(.SD, function(x) {  
  ifelse(x == 0, legend[, a0], legend[, a1]) 
})]
ADD COMMENT
0
Entering edit mode

I see ok thank you very much. I really appreciate.

ADD REPLY
0
Entering edit mode

I want to use IVAS for sQTL analysis and it accepts only allelic encoding of genotypes, so that they should be two letters of A,C,G,T

The format of my vcf file is like this:

#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  108 139 159 265

1   73  .   C   A   40  PASS    .   GT:DP:GQ    0|0:5:40    0|0:9:40    0|0:6:38    ./.:.:.

1   83  .   T   C,A 40  PASS    .   GT:DP:GQ    1|1:5:40    1|1:9:40    0|0:8:38    ./.:.:.

I want to convert the genotype format from numeric to letters

#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  108 139 159 265

1   73  .   C   A   40  PASS    .   GT:DP:GQ    CC  CC  CA  NA

1   83  .   T   C,A 40  PASS    .   GT:DP:GQ    TC  TC  TT  NA
ADD REPLY

Login before adding your answer.

Traffic: 3271 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6