SNP density plot
1
1
Entering edit mode
7.2 years ago
krp0001 ▴ 40

Hello,

all,

I am working on SNP studies on fungus. I have over 1million snps and i have denovo assembled reference genome of 45Mb. I would like to plot density of SNPs over the genome in sliding window or /best suitable option.

*the vcf file has information on scaffolds and same for reference genome (over 1700).

Could anyone suggest few programmes that can handle larger files?

Thank you krp

SNP R sequencing Assembly • 4.3k views
ADD COMMENT
0
Entering edit mode

How large is the vcf file for 1 million SNPs?

ADD REPLY
0
Entering edit mode

Hi my vcf file size is 911M

ADD REPLY
1
Entering edit mode
7.2 years ago

BEDOPS scales to arbitrarily-sized inputs. You could use Kent utilities fetchChromSizes to make a BED file of reference genome chromosome sizes (or your own approach, if you're bringing your own genome), use vcf2bed to convert the VCF file to BED, use bedops --chop to generate windows, and use bedmap to count SNPs over sliding windows.

Generating chromosome bounds, e.g. for hg38:

$ fetchChromSizes hg38 | awk '{ print $1, "0", $2 }' > hg38.bed

Or if you have a de novo genome, you would use your technique of choice to generate a similar bounds file.

Making sliding windows, e.g. 10k windows:

$ bedops --chop 10000 hg38.bed > hg38.windows.bed

Counting SNPs over windows that are padded out 5k on both sides:

$ bedmap --echo --count --range 5000 hg38.windows.bed <(vcf2bed < snps.vcf) > hg38.windows.counts.bed

Using --range this way does counting where consecutive windows overlap by 5kb.

Once you see how the above steps are done, you can concatenate them into a one-liner:

$ fetchChromSizes hg38 | awk '{ print $1, "0", $2 }' | bedops --chop 10000 - | bedmap --echo --count --range 5000 - <(vcf2bed < snps.vcf) > hg38.windows.counts.bed

Using Unix streams in this way eliminates the time cost in making intermediate files, so this scales up for whole-genome inputs.

Once you have a file like hg38.windows.counts.bed (for whatever reference genome and window size and sliding parameters you choose) you could remake it into a Bedgraph file and then into bigWig to plot density in a UCSC genome browser instance. Or plot the signal directly with Circos or R or other tools to view density per-chromosome.

ADD COMMENT
0
Entering edit mode

I am stuck with a similar question, I've got a list of SNPs (3 columns - chr1 / start / stop). I've managed to produce the bedops windows.bed file (for my organism / bacteria - used 1000 for chop). However, I get some odd results using the bedmap count:

$ bedmap --echo --count --range 500 mybacteria.windows.bed < listofsnps.csv > mybacteria.windows.counts.bed

The results look like this:

chr1 0 1000|2
chr1 1000 2000|3
chr1 2000 3000|3

a similar... whereas the first lines in my list of snps.txt look like this:

chr1 15397 15397
chr1 20973 20973
chr 26518 26518

Any help would be appreciated, perhaps just a simple bin command in R would suffice.

ADD REPLY
1
Entering edit mode
$ bedmap --echo --count --range 500 mybacteria.windows.bed SNPs.bed > answer.bed

If listofsnps.csv is a comma-separated file, then you will need to turn that into a sorted BED file (e.g., called SNPs.bed) to make it work with the BED-mapping tool bedmap. How you do this would be up to you, since you know its structure and content.

I don't know what the structure of listofsnps.csv looks like, but you do. I'd suggest converting the file to a sorted BED file and giving it another go. Hope this helps!

ADD REPLY
1
Entering edit mode

Thank you so much! I did a stupid mistake where the csv file actually was tab/BED format - so I just had to rename to BED and remove header and it solved it.

ADD REPLY

Login before adding your answer.

Traffic: 1498 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6