R programming question: genotyping data table manipulation2
2
2
Entering edit mode
9.6 years ago
MAPK ★ 2.1k

Hi Guys,

I have these two dataframes, df1 and df2. df1 with the alleles and df2 with the genotypes. There are more than 50 samples (1:50) comprised of Geno1.GT, Geno1.AD, Geno2.GT, Geno2.AD, ... Geno50.GT, Geno50.AD genotypes and depth coverages interleaving one after other. How do I get the columns matching only Geno and sample number (i.e skipping AD or GT extensions) (e.g., columns Geno1, Geno1.GT and Geno1.AD together) and get the result table. Thank you.

df1

Geno1     Geno2     Geno3
A         A         A
C         G         C
C         A         G

df2

Geno1.GT     Geno1.AD     Geno2.GT     Geno2.AD     Geno3.GT     Geno3.AD
0/0          22,3         0/0          33,2         0/0          33,3
0/0          2,0          0/1          22,3         1/1          43,33
0/1          55,45        0/0          32,2         1/1          22,3

Result

Geno1     Geno1.GT     Geno1.AD     Geno2     Geno2.GT     Geno2.AD     Geno3     Geno3.GT     Geno3.AD
A         0/0          22,3         A         0/0          33,2         A         0/0          33,3
C         0/0          2,0          G         0/1          22,3         C         1/1          43,33
C         0/1          55,45        A         0/0          32,2         G         1/1          22,3
R • 2.4k views
ADD COMMENT
0
Entering edit mode

Are the rows of df1 and df2 matching and in the same order?

ADD REPLY
0
Entering edit mode

Thank you Sean for your reply. There are more colnames in df1 Geno1:Geno100 or more. So all the the columns in df2 are present in df1, but not the other way around. DF1 is bigger than df2 in samples and the order is also different.

ADD REPLY
0
Entering edit mode

Sean asked about the rows. If the rows are in the same order, merging is going to be much easier.

ADD REPLY
0
Entering edit mode

Sorry, Yes the rows are in same order and equal in length.

ADD REPLY
1
Entering edit mode
9.6 years ago
seidel 11k

Assuming things are in order and you don't have to do any actual pattern matching, the answer could be as simple as:

# bind them together to form one dataframe
result.df <- cbind(df1,df2)
# rearrange the columns using a pattern of numbers
result.df <- result.df[,as.vector(t(cbind(1:50,seq(51,150,2),seq(52,150,2))))]

Since there would be a regular pattern of numbers by which to arrange the columns of the two dataframes you can specify the patterns and then get them in a series of numbers to interleave the columns as you wish. (also assumes your first df is 50 columns and the second has twice as many - the code be easily modified to use the actual number).

edit: if you have to perform pattern matching because the columns of the data frames are not in any coordinated order, you can use a similar strategy, but it's a little more complicated. See below - first a toy example with vectors matching your names, then the code applied to data frames.

### Proof of principle using simple vectors
# create allele names
df1 <- paste("Geno", 1:10, sep="")

# create genotype names
df2 <- paste("Geno", rep(1:10, each=2), c(".GT",".AD"), sep="")

# find the matching genotype and depth for each allele
gt.iv <- match(df1,sub(".GT","",df2))
ad.iv <- match(df1,sub(".AD","",df2))

# randomly scramble df2
df2 <- df2[sample(1:length(df2),length(df2))]

# combine all data
alldata <- c(df1,df2)

# order the combined data
alldata[as.vector(t(cbind(1:length(df1), gt.iv+length(df1),ad.iv+length(df1))))]

### apply the strategy to columns of dataframes called df1 & df2
# create two data frames
# alleles w/fake data
df1 <- data.frame(matrix(sample(c("0/0","1/0","0/1","1/1"),20,replace=T), nrow=2, ncol=10))
colnames(df1) <- paste("Geno", 1:ncol(df1), sep="")
# genotypes w/fake data
df2 <- data.frame(matrix(sample(1:100,20), nrow=2, ncol=20))
colnames(df2) <- paste("Geno", rep(1:(ncol(df2)/2), each=2), c(".GT",".AD"), sep="")

# add chaos to df2
df2 <- df2[,sample(1:ncol(df2),ncol(df2))]

# find the matching genotype and depth for each allele
gt.iv <- match(colnames(df1),sub(".GT","",colnames(df2)))
ad.iv <- match(colnames(df1),sub(".AD","",colnames(df2)))

# combine all data
alldata <- cbind(df1,df2)

# order the combined data
alldata <- alldata[,as.vector(t(cbind(1:ncol(df1), gt.iv+ncol(df1),ad.iv+ncol(df1))))]
ADD COMMENT
0
Entering edit mode

Sorry, this doesn't match the column name in df1 and reorder the columns. The ordering is not correct. Thank you

ADD REPLY
0
Entering edit mode

I edited my answer to include pattern matching, so the column names can be in arbitrary order between the matrices.

ADD REPLY
0
Entering edit mode

Thank you so much Seidel, but the last line generates errors with the dataframe:

> alldata <- alldata[,as.vector(t(cbind(1:ncol(cpseq), gt.iv+ncol(cpseq),ad.iv+ncol(cpseq))))]

Error in `[.data.frame`(alldata, , as.vector(t(cbind(1:ncol(cpseq),  : 
  undefined columns selected
ADD REPLY
0
Entering edit mode

I'm not sure what to tell you except that you'll have to trouble-shoot and see what's going on. I edited the example to contain two fake dataframes which resemble the ones above (except my fake allele depth data - I didn't want to think that hard how to generate columns of paired, comma delimited numbers), and the code executes on copy and paste without errors. Check your intermediate steps with your actual data.

ADD REPLY
0
Entering edit mode
9.6 years ago

One-off solution:

tmpResult = cbind(df1,df2)
result = tmpResult[,order(colnames(tmpResult))]

This assumes the colnames are as given in the post.

ADD COMMENT

Login before adding your answer.

Traffic: 1546 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6