Coding genotypes from nucleotides to 0/1/2 in R.
3
0
Entering edit mode
9.5 years ago
devenvyas ▴ 760

I have a massive data table with dbSNP rs ids as rows and samples as columns in this kind of format

dbSNP    Sample    Sample    Sample    Sample    Sample    Sample
rs10000011    CC    CC    CC    CC    TC    TC
rs1000002    TC    TT    CC    TT    TT    TC
rs10000023    TG    TG    TT    TG    TG    TG
rs1000003    AA    AG    AG    AA    AA    AG
rs10000041    TT    TG    TT    TT    TG    GG
rs10000046    GG    GG    AG    GG    GG    GG
rs10000057    AA    AG    GG    AA    AA    AA
rs10000073    TC    TT    TT    TT    TT    TT
rs10000092    TC    TC    CC    TC    TT    TT

There are over a 1000 samples and >547,000 loci in this table from a HGDP dataset (ftp://ftp.cephb.fr/hgdp_supp10/), and I would like to do a massive Principle Component Analysis (with samples colored based on population).

In order to do that, I need to code my genotypes first. I was wondering, how would I do this (preferably in R, as the file is probably too big for JMP Genomics)?

Also, I have some spots lacking data, which are indicated by --- or 00. I am going to standardize those to NA using a find and replace script, but how do I code it so R will still be able to run the PCA. Thanks!

SNP R • 9.8k views
ADD COMMENT
1
Entering edit mode

I am not sure if R can easily handle this big dataset either. I would suggest you to use PLINK (that directly evaluates the PCA), but I am afraid you will need to create extra files to describe your data. See: https://www.cog-genomics.org/plink2/input and here https://www.cog-genomics.org/plink2/strat#pca.

ADD REPLY
0
Entering edit mode

I can run R on UF's HPC cluster though. It will be able to handle it there.

ADD REPLY
0
Entering edit mode

Anyone have any suggestions? I tried stackoverflow, but they sent me back here.

ADD REPLY
2
Entering edit mode
9.5 years ago
alesssia ▴ 580

I still believe that using PLINK is a better idea, but if you want to use R, you can write a script that, for each line of your matrix (SNP), selects the major/minor allele and translates each genotype data (CC, TC, CT, TT, NN, ...) into 0,1,2, or missing accordingly. Use apply to cycle over the matrix to have better performance.

ADD COMMENT
0
Entering edit mode

But how do I get my data into Plink in the first place?

ADD REPLY
0
Entering edit mode

You can modify your file in order to have the correct PLINK format. See https://www.cog-genomics.org/plink2/input and https://www.cog-genomics.org/plink2/formats. You may be specifically interested in the .tped file (https://www.cog-genomics.org/plink2/formats#tped). Note that you need to add extra columns (mastering shell scripting will be useful) and that you should 'separate' the alleles in your genomic data (that is "CC" should be "C C").

To recode see: https://www.cog-genomics.org/plink2/data#recode.

ADD REPLY
0
Entering edit mode

I'm not even sure plink can handle something this large. If it uses SVD then it'll need to use its own linear algebra functions, since BLAS/LAPACK use 32 bit matrix indices (at least in the fortran code, not sure about the C versions).

ADD REPLY
0
Entering edit mode

PLINK's --pca only passes a [# of samples] x [# of samples] matrix to BLAS/LAPACK, so this dataset is not too large. (I think the current limit is around 46k samples, though I haven't tested this.)

EIGENSOFT 6 has a fast PCA approximation which may work in some cases PLINK does not.

ADD REPLY
0
Entering edit mode

I am looking into Eigensoft. For my eventual, primary data application (calculating admixture) I need Eigenstrat file format for Admixtools.

For now, JMP Genomics was able to code the genotypes for just my actual samples (92 samples, 587297 SNPs) and got R to do a PCA on just those (at least I think it is), which ate through 24GB of RAM in 3 hours. I am re-running those job with 48 GB to see if that is enough; I cannot imagine how much RAM the merged dataset would take.

I got a script from support at UF HPC that was able to code the fully data set (but it excludes the sites where the two sets are reading opposite strands).

ADD REPLY
2
Entering edit mode
9.5 years ago
Jimbou ▴ 960

You can use the SNPassoc package. But if the data is big plink would be the better choice as mentioned from alesssia already.

library(SNPassoc)
data("SNPs")
d <- SNPs[,-c(1:5)]

Create a SNP Matrix

myDat<-setupSNP(d,1:ncol(d),sep="")

Use the additive function and apply to create a numerical variable.

mydatnum <- apply(myDat,2,additive)

Prior the transformation you should exchange 00 or --- to NA with something like this:

d[ d == 00 ] <- NA
ADD COMMENT
0
Entering edit mode

Many thanks to you Jimbou ( Vielen Dank) It took two weeks to find an answer how to code my genotype into numerical variables in R

Still ill test your way hoping that it works. Otherwise i want to contact with you for extra help if thats ok and possible.

Best, Halah

ADD REPLY
0
Entering edit mode
9.5 years ago
george.ry ★ 1.2k
head -1 yourbigfile \
> yournewfile \
&& paste \
    <(tail -n+2 yourbigfile | cut -f1) \
    <(tail -n+2 yourbigfile | cut -f2- | sed $(python2 -c "from itertools import product; print 's/---/NA/g;s/00/NA/g;' + ';'.join(['s/{}/{}/g'.format(''.join(p), i) for i, p in enumerate(product('ACGT', 'ACGT'))])")) \
>> yournewfile
ADD COMMENT

Login before adding your answer.

Traffic: 1951 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6