Question

012 genotype matrix using vcf tools (converting rows to columns)

0

Entering edit mode

8.2 years ago

Ana ▴ 200

Hello everyone,

I have created a 012 genotype matrix by using --012 command in vcf tools which give individuals in row and SNPs in columns.

I want to run smartpca and and need to convert rows to columns. Means get SNPs in rows and individuals in columns.

I have tried some shel commands like this but seems it does not work.

awk ' { 
    for (i=1; i<=NF; i++)  {
        a[NR,i] = $i
    } } NF>p { p = NF } END {    
    for(j=1; j<=p; j++) {
        str=a[1,j]
        for(i=2; i<=NR; i++){
            str=str" "a[i,j];
        }
        print str
    } }'

I wonder if anyone has any idea how to do that?

Is there any way to do this manipulation in R? but R cannot read my output genotype matrix!

Thanks

genotype-matrix vcftools • 6.3k views

ADD COMMENT • link updated 2.7 years ago by Ram 45k • written 8.2 years ago by Ana ▴ 200

0

Entering edit mode

Please provide a few lines of your matrix (with headers). Try datamash (GNU tool available in most of the linux repos) to transpose data easy. input:

  $ cat test.txt 
    Sample  id-123  id-99   id-42   id-13
    Count   1002    990 2030    599

output:

$ datamash transpose < test.txt  
Sample  Count
id-123  1002
id-99   990
id-42   2030
id-13   599

ADD REPLY • link 8.2 years ago by cpad0112 21k

score 3 · Accepted Answer · 2017-09-07

3

Entering edit mode

8.2 years ago

Kevin Blighe 90k

You just literally want to transpose the matrix?

It works for me in R Programming Language, just use the t() function:

df <- t(read.table("out.012", header=FALSE))

This will give you the 012 matrix without rownames or colnames. The use of the [-1,] sub-setting just gets rid of the uninformative header that VCFtools outputs with -012

To add the colnames, just use:

colnames(df) <- read.table("out.012.indv")[,1]

I will let you do the rownames for yourself.

ADD COMMENT • link 8.2 years ago by Kevin Blighe 90k

0

Entering edit mode

Hi @Kevin Blighe, thank you for your response. Well are you sure that you could transpose 012 genotype matrix created by vcf-tools in R? This is my out pot from vcf file (012-out)

> 0 -1  0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   -1  0   0   0   0   0   1   0   0   0   0   0   0   -1  0   -1  -1  -1  0   0   -1  -1  -1

and R cannot read this file.I use this just to see

> df <- read.table(file = "012-out.txt", header=FALSE)
> head(df)

and the output doesnot make sense

> V636126 V636127 V636128 V636129 V636130 V636131 V636132 V636133
> V636134 V636135 V636136 V636137 V636138 V636139 V636140 V636141
> V636142 V636143 V636144 V636145 V636146 V636147 V636148 V636149
> V636150 V636151 V636152 V636153 V636154 V636155 V636156 V636157
> V636158 V636159 V636160 V636161 V636162 V636163 V636164 V636165
> V636166 V636167 V636168 V636169

I do not know what is wrong here and how can I solve it! I would sincerely appreciate if you could share your experience with me to get this done.

ADD REPLY • link 8.2 years ago by Ana ▴ 200

0

Entering edit mode

Yes, it genuinely works for me. I use VCFtools (0.1.15) and Microsoft R Open 3.2.5. You haven't done any conversions on the file after it is produced by VCFtools?

Your output above with the 'V' prefix is just the colnames that R automatically generates. I assume that there are 636,169 variants in your VCF?

Also, just to confirm, how many individuals are in your VCF? - 1 or more? The number of rows in the output file from VCFtools should match the number of samples in the VCF.

If you have >1 individuals, the line to read-in and transpose is:

df <- t(read.table("012-out.txt", header=FALSE))

If you just have 1 individual, then this works:

df <- t(read.table("012-out.txt", header=FALSE))

ADD REPLY • link 8.2 years ago by Kevin Blighe 90k

0

Entering edit mode

Hi @Kevin Blighe, Thanks. The the example above is just with subset of the data. yes! you are right, did you also get output with "V" prefix? By the way I found the solution, It works ... thanks so much

ADD REPLY • link 8.2 years ago by Ana ▴ 200

1

Entering edit mode

No, if I do the command exactly as I have it above (df <- t(read.table("out.012", header=FALSE))), and then run head(df), I get:

   [,1] [,2] [,3] [,4]
V1    0    1    2    3
V2    2    0    1    1
V3    2    2    2    2
V4   -1    2   -1   -1
V5    2   -1   -1   -1
V6   -1   -1   -1    2

This was a VCF file of 4 individuals and ~300,000 variants.

With a single individual, I get:

I also tried it on a VCF with >600,000 variants and it still functioned fine.

Which was the solution that worked?

ADD REPLY • link 8.2 years ago by Kevin Blighe 90k

0

Entering edit mode

So I created this little Rscript and got my desired output, thanks Kevin for sharing your experience with me:

>     #!/usr/bin/Rscript
>     
>     snps<-read.delim('wild_annuus.snps.fb.lenient.noTE_filt_DP15_GOOD_IND_ld.pruned.0.1._replaced_9.geno.table.012',header=F,row.names=1)
>     pos<-read.delim('wild_annuus.snps.fb.lenient.noTE_filt_DP15_GOOD_IND_ld.pruned.0.1.geno.table.012.pos',header=F)
>     indv<-read.delim('wild_annuus.snps.fb.lenient.noTE_filt_DP15_GOOD_IND_ld.pruned.0.1.geno.table.012.indv',header=F)
>     colnames(snps)<-paste(pos[,1],pos[,2],sep=':')
>     rownames(snps)<-indv[,1]
>     snps<-as.matrix(snps)
>     snps.convert<-t(snps)
>     
>     write.table(snps.convert, file= "snps.convert_all_pruned", col.names = FALSE, row.names = TRUE)