Given the files in HAP/LEGEND/SAMPLE format used by SHAPEI2, I intend to feed them into REHH2 R package. However, the alleles in HAP file is coded as 0, 1. The REHH hyplotype input file has alleles coded as A, C, T, and G. Is there any solution to convert numeric alleles into letters? Thank you in advance for ideas, solutions and questions.
Haven't tried the new REHH2 R package, but selscan https://github.com/szpiech/selscan was much faster than the previous version. Also, it can read the output files from SHAPEIT directly without any modifications.
Thanks a lot for the comments! I figured out the solution. In R the haplotype numerical file (hap) can be converted into literal given the LEGEND (map) as follow:
map is data table containing data from *.legend file.
haps is data table containing data from *.haplotypes file.
*.legend and *.haplotypes are produced by SHAPEIT2.0 as a result of phasing.
.SD is a symbol used specifically in data.table R package. It means "Subset of Data.table". In the code above it results in applying function(x) to each column of hap.
Thank you for your explanation. I have never used SHAPEIT before. I have a data set in the following way(it is just small part of my dataset) in R:
134788398 134792091 134793655 134794177
NA06989_A A A G G
NA06989_B A A G G
NA10850_A G A G G
NA10850_B G G A G
NA06984_A G G A G
NA06984_B A A G G
NA11917_A G A A T
NA11917_B A A G G
I need to obtain data frames etc. which have same content with hap and legend file to use in R. I hope I can explain clearly. Do you have any idea how can I figure out it? Thank you very much.
Ok, great! The genotypes in HAP file are classified into first and second alleles (a0, a1). In order to convert the data set you provided into HAP format, we need to know the classifier. The other words, the rule by which the alleles are marked as "first" or "second" ones should be provided.
Sometimes, the classification is made on the base of allele frequency in the population. For example, the most two frequent alleles become "first" and "second" one, respectively.
Any ideas on which allele will be the first or second? If there is no clue, describe the final step of your research. I'll try to figure out. Feel free to send me a personal message.
In my case, I think it doesn't matter. I am trying to order haplotypes by the focal SNP. As a result, I will try to obtain a graph which identifies similarities between populations. For example, I can code in that way:
## Script converts input data set into HAP and LEGEND formats.
## The file formats are described here https://mathgen.stats.ox.ac.uk/genetics_software/shapeit/shapeit.html#haplegsample
## The alleles are classified according to the rule: A-T ==> 0, G-C ==> 1
## We assume that there is no missing genotypes
# Toy data set
data = "
134788398 134792091 134793655 134794177
NA06989_A A A G G
NA06989_B A A G G
NA10850_A G A G G
NA10850_B G G A G
NA06984_A G G A G
NA06984_B A A G G
NA11917_A G A A T
NA11917_B A A G G"
# Read genotypes
hap <- read.table(text = data)
# Get a pair of unique genotypes for each SNP
legend <- apply(hap, 2, function(x) {
unique(x)
})
# Since HAP format assumes SNPs to be in rows and alleles in columns,
# transpose the data frame with genotypes
legend <- t(legend)
# Classify genotypes
legend <- apply(legend, 1, function(x) {
if(x[1] %in% c("A", "T")) {
a0 <- x[1]
a1 <- x[2]}
else {
a0 <- x[2]
a1 <- x[1]
}
c(a0, a1)
})
# Transpose
legend <- t(legend)
# Get SNP positions
pos <- t(read.table(text = data, nrows = 1))
# Finilize LEGEND
legend <- data.frame(id = paste0("snp", 1:nrow(pos)),
position = pos,
a0 = legend[, 1],
a1 = legend[, 2])
# Save legend as space delimited with header
write.table(legend, "data.legend", sep = " ", col.names = T, row.names = F, quote = F)
# Code genotypes
hap <- apply(hap, 1, function(x) ifelse(x %in% c("A", "T"), 0, 1))
# Save hap as space delimited without column and row names
write.table(hap, "data.hap", sep = " ", col.names = F, row.names = F)
Thank you very much your great help. I will try your solution. After you ask me about clasification,
I couldn't be sure about my classification rule. I found a legend file of my prof.
You can see a small part of this file. Do you have any idea about his classification rule according to the following list?
rs position X0 X1
rs11888260 136272200 A G
rs313518 136272822 C T
rs313519 136273249 A C
rs4954281 136273537 G T
rs313526 136279258 C T
rs313528 136279601 A G
rs7570283 136281439 C T
rs16831946 136281874 A G
Your are welcome! It is difficult to figure out the allele classification looking at the data. Obviously, it is either task specific or not. I would check next step of pipeline. Which tool, package will further consume the data? I would read the manual of this tool regarding the requirements for the input data. I would also check the algorithm, math behind this tool.
Could you provide a reproducible code? It should reflect how sorted and legend variable were defined. Paste also examples of data sets loaded into sorted and legend.
rsID position_b36 a0 a1
1 rs7575818 134788398 A G
2 rs3791281 134792091 A G
3 rs876887 134793655 G A
4 rs4953912 134794177 G T
5 rs1371128 134796050 C A
sorted has class data.frame. The above code I used to convert numerical haplotypes into lateral ones works if this variable has class data.table. Not all data.table syntax can be applied for data.frame. Possible quick fix:
I want to use IVAS for sQTL analysis and it accepts only allelic encoding of genotypes, so that they should be two letters of A,C,G,T
The format of my vcf file is like this:
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT 108 139 159 265
1 73 . C A 40 PASS . GT:DP:GQ 0|0:5:40 0|0:9:40 0|0:6:38 ./.:.:.
1 83 . T C,A 40 PASS . GT:DP:GQ 1|1:5:40 1|1:9:40 0|0:8:38 ./.:.:.
I want to convert the genotype format from numeric to letters
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT 108 139 159 265
1 73 . C A 40 PASS . GT:DP:GQ CC CC CA NA
1 83 . T C,A 40 PASS . GT:DP:GQ TC TC TT NA
I followed your link to get a better idea about your data. So what you have is e.g.:
Which you want to convert back to:
Do you also know the reference and alternative allele per position?
Haven't tried the new REHH2 R package, but selscan https://github.com/szpiech/selscan was much faster than the previous version. Also, it can read the output files from SHAPEIT directly without any modifications.