Annotate genomic positions with dbSNP rsIds
2
1
Entering edit mode
5.6 years ago
Jimbou ▴ 960

Although I already found some ways to annotate genomic positions with rsIDs using e.g. UCSC table browser, I'm not happy with that since I want a one-in-all linux script taking also strand issues (flipped alleles A-T vs- T-A or switched reference alleles) into account.

What I have:

chr position ref alt
10  169560   G   T
10  171117   G   A
10  171126   G   A
10  172995   A   C
10  178499   C   T

What I want:

chr position ref alt rsID
10  169560   G   T   rsXXX
10  171117   G   A   rsXXX, rsXXX
10  171126   G   A   rsXXX
10  172995   A   C   rsXXX
10  178499   C   T   rsXXX

Thanks

dbSNP bed • 6.4k views
ADD COMMENT
8
Entering edit mode
5.6 years ago
Jimbou ▴ 960

I will write down my solution as an answer for documentation purposes. I started as Pirerre recommended, but then I used bcftools instead of GATK.

First, I created a header .txt file for the custom vcf file

##fileformat=VCFv4.0
##fileDate=09052019
##source=allchr_allvsall_sex_adjusted
##reference==GRCh37.p13
##phasing=partial
##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele">
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO

Then I used awk to generate the data for vcf according the specifications (8 columns). Setting ID="." == missing, Quality to 100 and PASS for the filter for all positions. Of note my_chr_pos_alt_ref.out.gz data consists only of autosomal SNVs!

zcat my_chr_pos_alt_ref.out.gz | awk '{print $1, ".", $2, $3, $4, 100, "PASS", "AA="$3}' OFS='\t' > tmp.vcf

add the header

cat header.txt tmp.vcf > mydata.vcf
rm tmp*

zipped and indexed

bgzip mydata.vcf
tabix -p vcf mydata.vcf.gz

Finally annotated rsIDs using:

bcftools annotate \
-a 00-common_all.vcf.gz \
-c ID mydata.vcf.gz \
--output-type z \
-o mydata_dbSNP151.vcf.gz

dbSNP files from ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606_b151_GRCh37p13/VCF/

ADD COMMENT
1
Entering edit mode

Really helpful, thank you! Small addition, cause it took me embarrassingly long to notice why I was getting an error later on (otherwise worked like a charm) - but given the example header the zcat command should place the "." in the third position, i.e.,:

 zcat my_chr_pos_alt_ref.out.gz | awk '{print $1, $2, ".", $3, $4, 100, "PASS", "AA="$3}' OFS='\t' > tmp.vcf
ADD REPLY
0
Entering edit mode

Thank you so much. it worked for me!

ADD REPLY
0
Entering edit mode
ADD COMMENT
0
Entering edit mode

Thanks a lot. Started as you recommended, but switched to bcftools in the end.

ADD REPLY

Login before adding your answer.

Traffic: 2857 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6