SNPs of a specific mouse strain
3
0
Entering edit mode
13 months ago
darklings ▴ 580

Hi, I wonder how can I get SNPs for a particular mouse strain like C57BL6. I have downloaded a mouse reference vcf from https://ftp.ebi.ac.uk/pub/databases/mousegenomes/REL-2112-v8-SNPs_Indels/mgp_REL2021_snps.rsID.vcf.gz

Its header is

#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  129P2_OlaHsd    129S1_SvImJ 129S5SvEvBrd    A_J AKR_J   B10.RIII    BALB_cByJ   BALB_cJ BTBR_T+_Itpr3tf_J   BUB_BnC3H_HeH   C3H_HeJ C57BL_10J   C57BL_10SnJ C57BL_6NJ   C57BR_cdJ   C57L_J  C58_J   CAST_EiJ    CBA_J   CE_J    CZECHII_EiJ DBA_1J  DBA_2J  FVB_NJ  I_LnJ   JF1_MsJ KK_HiJ  LEWES_EiJ   LG_J    LP_J    MAMy_J  MOLF_EiJ    NOD_ShiLtJ  NON_LtJ NZB_B1NJ    NZO_HlLtJ   NZW_LacJ    PL_J    PWK_PhJ QSi3    QSi5    RF_J    RIIIS_J SEA_GnJ SJL_J   SM_J    SPRET_EiJ   ST_bJ   SWR_J   WSB_EiJ ZALENDE_EiJ

There is a C57BL_6NJ on the 24th field. I then checked

zcat mgp_REL2021_snps.rsID.vcf.gz | grep -v ^#|cut -f1-9,24|head

1   3050050 .   C   G   186.728 PASS    DP=3533;AD=3488,42;DP4=3041,447,44,1;MQ=60;CSQ=G|intergenic_variant|MODIFIER|||||||||||||||||||SNV||||||||,A|intergenic_variant|MODIFIER|||||||||||||||||||SNV||||||||;AC=2;AN=104  GT:PL:DP:AD:GQ:FI   0/0:0,223,255:74:74,0:127:1
1   3050069 .   C   T   5422.31 PASS    DP=3674;AD=2868,796;DP4=2447,421,659,147;MQ=60;CSQ=T|intergenic_variant|MODIFIER|||||||||||||||||||SNV||||||||,A|intergenic_variant|MODIFIER|||||||||||||||||||SNV||||||||,G|intergenic_variant|MODIFIER|||||||||||||||||||SNV||||||||;AC=45;AN=104 GT:PL:DP:AD:GQ:FI   0/0:0,223,255:74:74,0:127:1
1   3050076 .   C   T   33.1447 PASS    DP=3757;AD=3625,125;DP4=3030,595,96,36;MQ=60;CSQ=T|intergenic_variant|MODIFIER|||||||||||||||||||SNV||||||||,A|intergenic_variant|MODIFIER|||||||||||||||||||SNV||||||||,G|intergenic_variant|MODIFIER|||||||||||||||||||SNV||||||||;AC=1;AN=104    GT:PL:DP:AD:GQ:FI   0/0:0,223,255:74:74,0:127:1
1   3050115 .   G   A   189.718 PASS    DP=4289;AD=4244,39;DP4=3231,1013,42,3;MQ=59;CSQ=A|intergenic_variant|MODIFIER|||||||||||||||||||SNV||||||||,C|intergenic_variant|MODIFIER|||||||||||||||||||SNV||||||||,T|intergenic_variant|MODIFIER|||||||||||||||||||SNV||||||||;AC=2;AN=104 GT:PL:DP:AD:GQ:FI   0/0:0,255,255:104:104,0:127:1
1   3050118 .   G   A   4896.58 PASS    DP=4316;AD=3622,689;DP4=2750,872,524,170;MQ=59;CSQ=A|intergenic_variant|MODIFIER|||||||||||||||||||SNV||||||||,T|intergenic_variant|MODIFIER|||||||||||||||||||SNV||||||||,C|intergenic_variant|MODIFIER|||||||||||||||||||SNV||||||||;AC=28;AN=104 GT:PL:DP:AD:GQ:FI   0/1:234,0,255:107:68,39:127:0

There are 0/0, 0/1 or 1/0, 1/1 in the GT genotype values, I think to get snps I have to fetch those with 0/1 or 1/0 in the 24th column as they mean heterozygous variants for C57BL6? Am I correct? Thanks in advance.

snp • 1.4k views
ADD COMMENT
0
Entering edit mode

Technically, the mouse reference genome is C57BL6 N, so there are no SNPs relative to the reference for that strain. In reality, however, things are of course more complicated, since there are various substrains.

ADD REPLY
1
Entering edit mode
13 months ago

Everything you listed is a SNP (Single Nucleotide Polymorphism). I don't know why they all pass with a high score even though they are mostly marked as 0/0 with 0 AD, but the sample appears to only have the last one. 1/0, 0/1, 1|0, and 0|1 indicate heterozygous variations.

ADD COMMENT
0
Entering edit mode

Ah that's because I only printed out a couple of mouse strains of my interest, there are more strains' genotypes they collected (see the latter half of header)

ADD REPLY
1
Entering edit mode
9 weeks ago
ayayron117 ▴ 10

I use bcftools to do this:

bcftools view -s C57BL_6NJ mgp_REL2021_snps.rsID.vcf.gz -Oz -o C57BL_6NJ_mgp_REL2021_snps.rsID.vcf.gz
  • view is the subcommand used for subsetting, filtering, and converting vcf/bcf files
  • -s is used for specifying the names of the samples (in this case, strains) you want to extract
  • -Oz sets the output format to a compressed VCF file (.vcf.gz)
  • -o sets the name of the output file

https://samtools.github.io/bcftools/bcftools.html#view

ADD COMMENT
0
Entering edit mode
13 months ago
darklings ▴ 580

Posting my quick workaround

#!/usr/bin/env python

import pysam

strain = 'C57BL_6NJ'

with pysam.VariantFile('mgp_REL2021_snps.rsID.vcf.gz') as vcf:
    with pysam.VariantFile(f'{strain}.mgp_REL2021_snps.rsID.vcf.gz', "w", header=vcf.header) as out:
        for v in vcf:
            strain_alleles = v.samples[strain].alleles
            if len(set(strain_alleles)) > 1:
                out.write(v)
ADD COMMENT

Login before adding your answer.

Traffic: 2062 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6