Question

Sorting the haplotypes by similarities of SNPs in R

0

Entering edit mode

7.0 years ago

genogeno • 0

I have a data set and I want to sort it in the following way in R. I hope I can explain clearly.

1- Sort by the elements seen in the main column (focal SNP). This will give us two chunks, one chunk with all As and one chunk with all Gs.
2- Then for the first chunk, move to the -1 column position, and sort by the elements seen there (there are two elements, C/T). This will break the first chunk into two smaller chunks, one with A at the main column and C at the - 1st column; and one chunk with A at the main column and T at the - 1st column.
3- For the second chunk, move to the -1 column and do the same. I will end up with two smaller chunks, one with G at the main column and C at the - 1st column; and one with G at the main column and T at the -1th column.
4- Move to the +1 column and do the same. At each step, I will end up partitioning each of the existing chunks into two new chunks.

Actually, column names are positions(bp) in my data and the rows are haplotypes.

I do not want to break the row pattern. I want to sort the rows (swap the arrangement of the rows), but I won't re-arrange the columns. How can I do that?

An idea: I did this sorting by hand and I got a normal distribution shape. That's why I gave weights (for every column) which were obtained by normal distribution function. After that I got a weighted covariance matrix (number of rows x number of rows) by using the dissimilarity coefficient between rows and weights. Then I ranked the data by using eigenvectors of correlation matrix which has the penalty for missing data. However I could not reach the result that I reached by hand. My data is so big but I am sharing a small part of it.

-7  -6  -5  -4  -3  -2  -1  Main    1   2   3   4
A   C   C   A   C   C   T   A   G   A   T   G
A   C   C   A   C   C   T   A   G   A   T   G
A   C   C   A   C   C   T   A   G   A   T   G
A   T   C   G   C   T   C   G   G   G   T   G
A   C   C   A   C   C   T   A   G   A   T   G
G   C   T   G   C   T   T   G   G   G   T   G
A   C   C   A   C   C   T   G   G   A   T   G
G   C   T   G   C   T   T   G   G   G   T   G
A   C   C   A   C   C   T   G   G   A   T   G
A   C   C   A   C   C   T   G   G   A   T   G
A   C   C   A   C   C   T   A   G   A   T   G
A   C   C   A   C   C   T   A   G   A   T   G
A   C   C   A   C   C   T   A   G   A   T   G
A   C   C   A   C   C   T   A   G   A   T   G
A   C   C   A   C   C   T   A   G   A   T   G
A   C   C   A   C   C   T   A   G   A   T   G
A   C   C   A   C   C   T   A   G   A   T   G
A   C   C   A   C   C   T   A   G   A   T   G
A   C   C   A   C   C   T   A   G   A   T   G
A   C   C   A   C   C   T   A   G   A   T   G
A   C   C   A   C   C   T   A   G   A   T   G
A   C   C   A   C   C   T   G   G   G   T   G
A   C   C   A   T   C   T   A   G   A   T   G
A   C   C   A   C   C   T   A   G   A   T   G
A   C   C   A   C   C   T   A   G   A   T   G
A   C   C   G   C   T   T   G   A   G   C   T
A   C   C   A   C   C   T   A   G   A   T   G
A   C   C   A   C   C   T   A   G   A   T   G
A   C   C   A   C   C   T   A   G   A   T   G
A   C   C   A   C   C   T   A   G   A   T   G
A   C   C   A   C   C   T   A   G   A   T   G
A   C   C   A   C   C   T   A   G   A   T   G
A   C   C   A   C   C   T   A   G   A   T   G
A   C   C   A   C   C   T   A   G   A   T   G

SNP clustering haplotype • 1.9k views

ADD COMMENT • link updated 6.8 years ago by Biostar 20 • written 7.0 years ago by genogeno • 0

0

Entering edit mode

This will order the dataframe by "Main" and "-1" (minus1). You probably should not use numbers as headers.

dat[order(dat$Main,dat$minus1),]

where dat is your full data frame

ADD REPLY • link 7.0 years ago by christopher medway ▴ 460

0

Entering edit mode

Thank you! Unfornutaly, it doesn't give what I want. I guess it is more complicated than that.

ADD REPLY • link 6.9 years ago by genogeno • 0

0

Entering edit mode

see if this works: test.txt is text in OP with tab separated values

df=read.csv("test.txt", stringsAsFactors = F, header = T, sep = "\t")
df [order(df[,"Main"], df[,"X.1"],df[,"X1"]),]

or

df=read.csv("test.txt", stringsAsFactors = F, header = T, sep = "\t")
library(dplyr)
dplyr::arrange(df,Main,X.1,X1)

output:

   > df [order(df[,"Main"], df[,"X.1"],df[,"X1"]),]
       X.7 X.6 X.5 X.4 X.3 X.2 X.1 Main X1 X2 X3 X4
    1    A   C   C   A   C   C   T    A  G  A  T  G
    2    A   C   C   A   C   C   T    A  G  A  T  G
    3    A   C   C   A   C   C   T    A  G  A  T  G
    5    A   C   C   A   C   C   T    A  G  A  T  G
    11   A   C   C   A   C   C   T    A  G  A  T  G
    12   A   C   C   A   C   C   T    A  G  A  T  G
    13   A   C   C   A   C   C   T    A  G  A  T  G
    14   A   C   C   A   C   C   T    A  G  A  T  G
    15   A   C   C   A   C   C   T    A  G  A  T  G
    16   A   C   C   A   C   C   T    A  G  A  T  G
    17   A   C   C   A   C   C   T    A  G  A  T  G
    18   A   C   C   A   C   C   T    A  G  A  T  G
    19   A   C   C   A   C   C   T    A  G  A  T  G
    20   A   C   C   A   C   C   T    A  G  A  T  G
    21   A   C   C   A   C   C   T    A  G  A  T  G
    23   A   C   C   A   T   C   T    A  G  A  T  G
    24   A   C   C   A   C   C   T    A  G  A  T  G
    25   A   C   C   A   C   C   T    A  G  A  T  G
    27   A   C   C   A   C   C   T    A  G  A  T  G
    28   A   C   C   A   C   C   T    A  G  A  T  G
    29   A   C   C   A   C   C   T    A  G  A  T  G
    30   A   C   C   A   C   C   T    A  G  A  T  G
    31   A   C   C   A   C   C   T    A  G  A  T  G
    32   A   C   C   A   C   C   T    A  G  A  T  G
    33   A   C   C   A   C   C   T    A  G  A  T  G
    34   A   C   C   A   C   C   T    A  G  A  T  G
    4    A   T   C   G   C   T   C    G  G  G  T  G
    26   A   C   C   G   C   T   T    G  A  G  C  T
    6    G   C   T   G   C   T   T    G  G  G  T  G
    7    A   C   C   A   C   C   T    G  G  A  T  G
    8    G   C   T   G   C   T   T    G  G  G  T  G
    9    A   C   C   A   C   C   T    G  G  A  T  G
    10   A   C   C   A   C   C   T    G  G  A  T  G
    22   A   C   C   A   C   C   T    G  G  G  T  G

ADD REPLY • link 6.8 years ago by cpad0112 21k