Plotting SNP density along a chromosome from VCF files
2
6
Entering edit mode
8.1 years ago
rc16955 ▴ 90

Hi all,

I have generated VCF files from a range of samples of my study organism using a published reference genome. I now want to use these to assess where diversity is along a given chromosome; in other words I want to know the density of SNPs along chromosome 2.

One way I had thought of doing this was to split the chromosome into 10Kb chunks, count the number of SNPs in each one, and plot this in a chart. So what I want is basically:

Chr2:1-10000, SNPs = X Chr2:10000-20000, SNPs = Y etc.

But I can't figure out an efficient way of actually extracting this information from my VCF files.I am able to use the command

tabix -h myfile.vcf.gz chr2:[10Kregion] > myoutput.vcf

to extract a given region of any VCF file and then use bcftools stats to count the number of SNPs therein, but the chromosome is 16Mb long; there must be a more efficient way to do this than for me to manually extract 1,600 VCF files and examine them with bcftools stats.

I wonder if there's some software that could help with this. I have Integrated Genome Viewer, MEGA and SeaView, but if any of them have this kind of functionality, I haven't found it yet.

Otherwise, any help or advice would be much appreciated. It must be doable because I see plenty of published papers with SNP density plotted against chromosome position, but I can't figure out how to do it myself.

Thanks in advance!

snp genome • 17k views
ADD COMMENT
19
Entering edit mode
8.1 years ago
sacha ★ 2.4k

I did something similar with human genom here : https://github.com/dridk/snp_location
Briefly I create my bins ( 10000 pb) using bedtools makewindows :

bedtools makewindows -g hg19.sizes -w 10000 > windows.bed

Then I count SNP per bin using bedtools coverage

bedtools coverage -a windows.bed -b yourdata.vcf -counts > coverage.txt

Finally I plot my data using R with IdeoViz package. see https://github.com/dridk/snp_location/blob/master/plot_chrom.r .
I get the following plot showing SNP density per bin :

enter image description here

ADD COMMENT
0
Entering edit mode

I have tried using bedtools coverage and get errors at several lines of the file:

WARNING: line number 10210 of file chr21.vcf.gz has an imprecise variant, but no SVLEN specified. Defaulting to length 1.

The resulting output gives a count of 0 at every window. Can anyone help with this?

ADD REPLY
0
Entering edit mode

Hi Sacha, I also want to do something similar but I for each sample from a multisample vcf file. I tried extracting a single sample from the multi-vcf file and then followed your command but the number of variants counts for different samples remain the same (probably these are the sites where a variant has been being called across the samples). Could you guide me to extract unique variants for each individuate and then plot the variants in a bin size.

ADD REPLY
0
Entering edit mode

Greetings Sacha, just out of curiosity, I want to do something similar to your solution but for a non-model organism (in my field usually is used Circos but visually I find more pleasant your solution) is it possible to do so? And eventually how can I build the ideo file my self? Thank you in advance

ADD REPLY
5
Entering edit mode
8.1 years ago

using sqlite

 curl -s "ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase3/integrated_sv_map/ALL.wgs.integrated_sv_map_v2.20130502.svs.genotypes.vcf.gz" |\
gunzip -c | grep -v "^#" |\
awk -F '\t' 'BEGIN{printf("create table T(c text,p int); BEGIN TRANSACTION;\n");} {printf("insert into T(c,p) values(\"%s\",%d);\n",$1,$2);} END {printf("COMMIT; SELECT c,CAST(p/1E7  AS INT)*1E7, count(*) from T group by c,CAST(p/1E7  AS INT)*1E7 ; drop table T;");}' |\
sqlite3 -separator ' ' tmp.sqlite  && rm tmp.sqlite

1 0.0 245
1 10000000.0 208
1 20000000.0 191
1 30000000.0 197
1 40000000.0 211
1 50000000.0 198
1 60000000.0 159
1 70000000.0 240
1 80000000.0 194
1 90000000.0 165
1 100000000.0 220
1 110000000.0 206
1 120000000.0 15
1 140000000.0 101
1 150000000.0 202
1 160000000.0 179
1 170000000.0 181
1 180000000.0 231
1 190000000.0 260
(...)
ADD COMMENT
0
Entering edit mode

Dear Pierre Lindenbaum! I hope this message finds you well. How would I have to modify this code to have it print out the result into a file?

ADD REPLY
0
Entering edit mode

Thank you for your response. I managed using:

grep -v "^#" chr25newids.vcf | awk -F '\t' 'BEGIN{printf("create table T(c text,p int); BEGIN TRANSACTION;\n");} {printf("insert into T(c,p) values(\"%s\",%d);\n",$1,$2);} END {printf("COMMIT; SELECT c,CAST(p/1E4 AS INT)1E4, count() from T group by c,CAST(p/1E4 AS INT)*1E4;");}' | sqlite3 -separator ' ' tmp.sqlite

sqlite3 -separator ' ' tmp.sqlite "SELECT c,CAST(p/1E4 AS INT)1E4, count() from T group by c,CAST(p/1E4 AS INT)*1E4;" > somefile.txt

ADD REPLY

Login before adding your answer.

Traffic: 1679 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6