Question

SnpMatrix from VCF file

1

Entering edit mode

8.2 years ago

Dhana ▴ 110

Hi,

I am planning on doing eQTL analysis using the MatrixEQTL R (MatrixEQTL link) package and I need to extract genotype information and create a SnpMatrix file for that like the one given here SNP.txt file link. I have tried vcftools to create a similar looking file but to no avail, I have tried

vcf-to-tab < genotypes.vcf > requiredFormat.txt
- Sample output
vcftools --vcf genotypes.vcf --extract-FORMAT-info GT --out requiredFormat.txt
- Sample output

Is there any other way to extract genotype information from the VCF file and convert it to that format?

Thanks

Description of the SnpMatrix file - In VCF files, 0 represents the reference allele and integers greater than 0 represent the alternate alleles (i.e., 2, 3, 4 would indicate the 2nd, 3rd or 4th allele in the ALT field for a particular variant). This function only supports variants with a single alternate allele and therefore the alternate values will always be 1. Genotypes are stored in the SnpMatrix as 0, 1, 2 or 3 where 0 = missing, 1 = "0/0", 2 = "0/1" or "1/0" and 3 = "1/1". In SnpMatrix terminology, "A" is the reference allele and "B" is the risk allele. Equivalent statements to those made with 0 and 1 allele values would be 0 = missing, 1 = "A/A", 2 = "A/B" or "B/A" and 3 = "B/B".

VCF Eqtl MatrixEQTL SnpMatrix vcftools • 16k views

ADD COMMENT • link updated 14 months ago by pmiller ▴ 10 • written 8.2 years ago by Dhana ▴ 110

0

Entering edit mode

Hi, I have the same problem - I am attempting to use a vcf file to create a snp matrix formatted for analysis with MatrixEQTL.

I was not able to follow the solution below and I am not familiar with javascript. Could further details be provided for this solution?

Are there complimentary approaches using plink, vcftools, python, R or some combination thereof? Thanks

ADD REPLY • link 7.7 years ago by wa3j • 0

0

Entering edit mode

I was not able to follow the solution below

what's wrong ? https://meta.stackexchange.com/questions/147616/

ADD REPLY • link 7.7 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

I am not familiar with using javascript and am unclear as to how to implement the script posted. Here is a solution in R that I believe may be simpler:

library(dplyr)
library(VariantAnnotation)
library(snpStats)
row_na_cut = 0.2
fname = "sub_maf_filtered.vcf.gz"
tab <- TabixFile(fname, yieldSize=500000)
param <- ScanVcfParam(fixed="ALT", geno=c("GT", "GL"))
open(tab)
geno_list = c()
while (nrow(vcf_yield <- readVcf(tab, "hg19", param=param)))
  mat <- genotypeToSnpMatrix(vcf_yield)
  nums0 = t(as(mat$genotype, "numeric")) %>% as.data.frame
  row_rem_NA = nums0[rowSumsis.na(nums0)) < row_na_cut*ncol(nums0),]
  if(nrow(row_rem_NA)*ncol(row_rem_NA) > 0){geno_list = rbind( geno_list, row_rem_NA )
}
close(tab)

ADD REPLY • link 7.7 years ago by wa3j • 0

0

Entering edit mode

the interpreter is a java program named bioalcidae. And it is invoked using

java -jar dist/bioalcidae.jar -f your_script.js input.vcf

ADD REPLY • link 7.7 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

What will you do in multiallelic cases? I have some GT values like 3/3, 1/2, 4/4 etc. You think the coding should be like:

0/0 --> 1 Any other heterozygous combination [1/2, 0/1 etc] --> 2 All other homozygous combination [1/1, 3/3 etc] --> 3 But those positions are ignored when I run BioAlcidae. Can you please share your ideas about this issue? Thanks.

ADD REPLY • link 5.2 years ago by obidobi ▴ 30

3

Entering edit mode

7.2 years ago

Leandro Lima ▴ 970

This should also work (I'm assuming that everything that is not 0/0, 0/1 or 1/1 should be considered as 0):

vcf=genotypes.vcf

vcftools --vcf $vcf --extract-FORMAT-info GT --out requiredFormat

# Creating SNP.txt
awk '{$1=$1":"$2; $2=""; print $0}' requiredFormat.GT.FORMAT | perl -pe 's!0/0!1!g; s!0/1!2!g; s!1/1!2!g; s!./.!0!g; s/CHROM:POS/id/g' > SNP.txt

# Creating snpsloc.txt
echo -e "snp\tchr\tpos" > snpsloc.txt
awk 'NR > 1 {id=$1":"$2; print id"\tchr"$1"\t"$2}' requiredFormat.GT.FORMAT  >> snpsloc.txt

ADD COMMENT • link 7.2 years ago by Leandro Lima ▴ 970

0

Entering edit mode

What will you do in multiallelic cases? I have some GT values like 3/3, 1/2, 4/4 etc. You think the coding should be like:

0/0 --> 1 Any other heterozygous combination [1/2, 0/1 etc] --> 2 All other homozygous combination [1/1, 3/3 etc] --> 3 But those positions are ignored when I run your script. Can you please share your ideas about this issue? Thanks.

ADD REPLY • link 5.2 years ago by obidobi ▴ 30

0

Entering edit mode

hi, isn't there something wrong with the substitution of genotype into dosage? why is 0/0 > 1 0/1 > 2 1/1 > 2

ADD REPLY • link 5.1 years ago by afadda ▴ 20

0

Entering edit mode

I liked the idea by Lima but why substitute 1/1 by 2 again? I would substitute 1/1 by 3.

ADD REPLY • link 4.1 years ago by binodregmi30 ▴ 10

0

Entering edit mode

I liked the idea of LIma but I did not understand why 0/1 and 1/1 both substitute by 2? my preference is 1/1 substitute by 3.

ADD REPLY • link 4.1 years ago by binodregmi30 ▴ 10

0

Entering edit mode

5.2 years ago

obidobi ▴ 30

Hello!

What will you do in multiallelic cases? I have some GT values like 3/3, 1/2, 4/4 etc. You think the coding should be like:

0/0 --> 1
Any other heterozygous combination [1/2, 0/1 etc] --> 2
All other homozygous combination [1/1, 3/3 etc] --> 3

But those positions are ignored when I run BioAlcidae. Can you please share your ideas about this issue? Thanks.

Cheers, Nilay

ADD COMMENT • link 5.2 years ago by obidobi ▴ 30

0

Entering edit mode

I use bcftools norm -m - to convert multiallelics into separate rows of biallelic sites. https://www.htslib.org/doc/bcftools.html#norm

ADD REPLY • link 4.5 years ago by wkretzsch • 0

score 6 · Accepted Answer · 2017-03-13

6

Entering edit mode

8.2 years ago

Pierre Lindenbaum 166k

using bioalcidae and the following script: http://lindenb.github.io/jvarkit/BioAlcidae.html

var snp_id=0;
var samples = header.getSampleNamesInOrder();
out.print("id");
for(var i=0;i< samples.size();++i)
    {
    out.print("\t"+samples.get(i));
    }
out.println();

while(iter.hasNext()) {
 var ctx= iter.next();
 ++snp_id;
 out.print(new java.lang.Integer(snp_id));
 for(var i=0;i< samples.size();++i)
    {
    out.print("\t");
    var g= ctx.getGenotype(samples.get(i));
    if(g.isHomRef())
        {
        out.print("1");
        }
    else if(g.isHet())
        {
        out.print("2");
        }
    else if(g.isHomVar())
        {
        out.print("3");
        }
    else 
        {
        out.print("0");
        }
    }

 out.println();
 }

ADD COMMENT • link 7.7 years ago by Pierre Lindenbaum 166k

1

Entering edit mode

I know you posted this a long time ago, but I just wanted to say THANK YOU! I got this to work after a lot of time spent trying to get the right output using bcftools and VCF-Simplify (both of which I've used before, but they failed me this time). I was afraid to try and use a java tool since I haven't before, but this was not hard and it works beautifully to get a snpmatrix in the form I needed for snpReady. I'm going to keep using the jvarkit tools because there are some really amazing options there. Just wish I'd found it earlier during my several years of attempting to become a capable person in bioinformatics (despite really being a wet lab molecular biologist with evolutionary theory as my passion). ;-)

ADD REPLY • link 14 months ago by pmiller ▴ 10

0

Entering edit mode

Thank you very much Pierre. This was the solution I was looking for.

ADD REPLY • link 8.2 years ago by Dhana ▴ 110

0

Entering edit mode

Hi Pierre, I used the Bioalcidae tool and the provided script for generating the matrix, I wanted to ask is it possible to replace the first field (ID's) of the resulting matrix to the respective variant ID's from the vcf file? or is using the tools such as cut or awk to replace the ids is the only option.

ADD REPLY • link 8.2 years ago by Dhana ▴ 110

0

Entering edit mode

 out.print( ctx.hasID()?ctx.getID():".");

ADD REPLY • link 8.2 years ago by Pierre Lindenbaum 166k

1

Entering edit mode

var snp_id=0;
var samples = header.getSampleNamesInOrder();
out.print("id");
for(var i=0;i< samples.size();++i)
    {
    out.print("\t"+samples.get(i));
    }
out.println();

while(iter.hasNext()) {
var ctx= iter.next();
++snp_id;
out.print( ctx.hasID()?ctx.getID():".");
for(var i=0;i< samples.size();++i)
    {
    out.print("\t");
    var g= ctx.getGenotype(samples.get(i));
    if(g.isHomRef())
        {
        out.print("1");
        }
    else if(g.isHet())
        {
        out.print("2");
        }
    else if(g.isHomVar())
        {
        out.print("3");
        }
    else 
        {
        out.print("0");
        }
    }

  out.println();
 }

For others who might need it. Thanks Pierre.

ADD REPLY • link 8.2 years ago by Dhana ▴ 110

0

Entering edit mode

Hi@Pierre Lindenbaum. I am also facing with the same problem as @Dhana. I have a filtered vcf file and I want to create something exactly similar to SNP.txt file that @Dhana showed above. Does this script has been created for this purpose? I could not get very well what is the input format of @Dhana's file!

ADD REPLY • link 7.7 years ago by Ana ▴ 200