1000 genomes hg38 with dbSNP rsid
1
0
Entering edit mode
20 months ago
Vince ▴ 150

Hi,

Anyone know where I can download the latest version of 1000 Genomes, on build hg38, in VCF format (or PLINK format), that ALSO contains the dbSNP RSid in the VCF ID field?

I looked at the IGSR website, dbSNP, UCSC, etc. So far no luck. All have either '.' in the ID field or "chrom:pos:a1:a2" in the ID field.

Thanks,
Vince

1000genomes dbsnp • 2.3k views
ADD COMMENT
0
Entering edit mode

If you can take a slightly different route, I'd recommend getting the gnomAD VCF which has both 1000g and dbSNP annotations. If not, it might be easier to download the 1000g VCF, the latest dbSNP VCF and use bcftools annotate to get IDs from the latter on to the former.

ADD REPLY
3
Entering edit mode
20 months ago
barslmn ★ 2.3k

The one on the Ensembl ftp site have rsids.

❯ wget -O- -q https://ftp.ensembl.org/pub/current_variation/vcf/homo_sapiens/1000GENOMES-phase_3.vcf.gz | bcftools view -H | sed 5q
[W::bcf_hrec_check] Invalid tag name: "HGMD-PUBLIC_20204"
1       10178   rs367896724     C       CC      .       .       dbSNP_154;TSA=indel;E_Freq;E_1000G;E_TOPMed;AFR=0.4909;AMR=0.3602;EAS=0.3363;EUR=0.4056;SAS=0.4949
1       10236   rs540431307     A       AA      .       .       dbSNP_154;TSA=indel;E_Freq;E_1000G;AFR=0;AMR=0.0014;EAS=0;EUR=0;SAS=0.0051
1       10353   rs555500075     A       AA      .       .       dbSNP_154;TSA=indel;E_Freq;E_1000G;E_TOPMed;AFR=0.4788;AMR=0.4107;EAS=0.4306;EUR=0.4264;SAS=0.4192
1       10505   rs548419688     A       T       .       .       dbSNP_154;TSA=SNV;E_Freq;E_1000G;MA=T;MAF=0.0002;MAC=1;AFR=0.0008;AMR=0;EAS=0;EUR=0;SAS=0
1       10506   rs568405545     C       G,T     .       .       dbSNP_154;TSA=SNV;E_Freq;E_1000G;E_gnomAD;MA=G;MAF=0.0002;MAC=1;AFR=0.0008,0;AMR=0,0;EAS=0,0;EUR=0,0;SAS=0,0
ADD COMMENT
0
Entering edit mode

Thanks! For posterity, it may be unclear to others what genome version this is. It is indeed hg38 as wget -O- -q https://ftp.ensembl.org/pub/current_variation/vcf/homo_sapiens/1000GENOMES-phase_3.vcf.gz | bcftools view | head -n 5 shows release-109 which is on hg38, https://ftp.ensembl.org/pub/release-109/fasta/homo_sapiens/dna/

ADD REPLY
0
Entering edit mode

Actually, I need the VCF file with the individual level data, not the sites file.

ADD REPLY
0
Entering edit mode

In that case, you'll need to get the VCF from 1000g - that's probably going to be the only place where individual level data is available, and then annotate that VCF.

ADD REPLY
0
Entering edit mode

Yeah, I had some hope that I wouldn't need to mess with doing this ...

ADD REPLY
0
Entering edit mode

It should be pretty straightforward. Just to save you some pain, run these on the 1000g VCF once you download it:

  1. vt decompose -s to split multi-allelics
  2. vt norm to left align and normalize indels

Though dbSNP assigns rsIDs by CHROM and POS only, having unique entries at CHROM-POS-REF-ALT level will help with any downstream annotations that depend on exact ALT matches.

ADD REPLY

Login before adding your answer.

Traffic: 1849 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6