Question

SNP Count from telomeres in sliding window

0

Entering edit mode

5.8 years ago

hellbio ▴ 520

Hi,

I have a list of SNP genomic coordinates:

a.txt
chr1    12345    12345
chr1    32424    32424    
chr4    23234    23234
....
....
chr22    332324    332324

I would like to find the count of SNPs i.e. from the start of the chromosomes in the genome with increasing length of the chromosomes to plot the proportion of SNPs Y-axis and Distance from the telomeres on X-axis. Could someone hint if it is possible by existing tools for eg: bedtools etc..

genome SNP • 2.1k views

ADD COMMENT • link updated 5.8 years ago by Alex Reynolds 36k • written 5.8 years ago by hellbio ▴ 520

0

Entering edit mode

Be aware that the edges of chromosomes are often low-complexity regions known for accumulating all kinds of false-positives alignments and false variant calls. Read this paper.

ADD REPLY • link 5.8 years ago by ATpoint 85k

score 0 · Answer 1 · 2019-01-30

0

Entering edit mode

5.8 years ago

Alex Reynolds 36k

One way is via the BEDOPS and UCSC Kent utilities toolkits.

First, generate chromosome bounds:

$ fetchChromSizes hg38 \
    | awk -vOFS="\t" '{ print $1, "0", $2; }' \
    | grep -vE '_' \
    | sort-bed - \
    > hg38.nuc.bed

Second, get SNPs (or use your own — replace common_20180418.starch with a sort-bed-sorted copy of a.txt, if you wanted, for instance):

$ VCF=ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606_b151_GRCh38p7/VCF/common_all_20180418.vcf.gz
$ wget -qO- ${VCF}
    | gunzip -c -
    | convert2bed --input=vcf --sort-tmpdir=${PWD} -
    | awk '{ print "chr"$0; }' -
    | starch --omit-signature -
    > common_20180418.starch

Finally, map SNPs to windows over the bounds. In the following example, we chop up the chromosome into sliding windows that are 100kb wide and staggered every 20kb. These windows are piped to bedmap, which counts the number of SNPs that overlap each window:

$ bedops --chop 100000 --stagger 20000 hg38.nuc.bed \
    | bedmap --echo --count --delim '\t' - common_20180418.starch \
    > windows_with_counts.bed

The count of SNPs over the window is in the last column of windows_with_counts.bed.

ADD COMMENT • link 5.8 years ago by Alex Reynolds 36k

0

Entering edit mode

Thank you. Does the count table reports the count of SNPs from both ends of the telomeres in each chromosome? Ideally, it is only required to get the count of SNPs for example upto 20Mb from either ends of the chromosomes. The middle regions i.e. SNPs in the centromeres of the chromosomes are not required to be reported.

ADD REPLY • link 5.8 years ago by hellbio ▴ 520

0

Entering edit mode

I admit I don't know much about telomeres. Is 20Mb correct? The UCSC Genome Browser gap positions table labels a 10kb region from the tips as telomeres:

$ wget -qO- http://hgdownload.cse.ucsc.edu/goldenpath/hg38/database/gap.txt.gz | gunzip -c | awk -vFS="\t" -vOFS="\t" '($8 == "telomere"){ print $2, $3, $4 }' | sort-bed - > hg38.telomeres.bed

If you believe this data, then you could window this file, albeit with smaller parameters for the sliding window:

$ bedops --chop 1000 --stagger 500 hg38.telomeres.bed | bedmap ... > windows_with_counts.bed

ADD REPLY • link 5.8 years ago by Alex Reynolds 36k

0

Entering edit mode

Here the issue is that we are working on model organism for which the telomeres are not defined and available in UCSC. Therefore, we work with an assumption of 20Mb or try 10Mb later to define the telomeres. I wonder how to do this with this assumption.

ADD REPLY • link 5.8 years ago by hellbio ▴ 520

1

Entering edit mode

You could modify the way in which you manipulate bounds:

$ fetchChromSizes hg38 \
    | awk -vOFS="\t" -vbounds=20000000 '{ printf("%s\t0\t%d\n%s\t%d\t%d\n", $1, bounds, $1, ($2 - bounds - 1), $2; }' \
    | grep -vE '_' \
    | sort-bed - \
    > hg38.custom_telomeres.nuc.bed

Replace hg38 with your build name, for the model organism you are working with.

Then use hg38.custom_telomeres.nuc.bed (or whatever it is called) with bedops --chop --stagger to make sliding windows.

ADD REPLY • link 5.8 years ago by Alex Reynolds 36k

0

Entering edit mode

It gives a syntax error at the semicolon after $2 :

awk: { printf("%s\t0\t%d\n%s\t%d\t%d\n", $1, bounds, $1, ($2 - bounds - 1), $2; } ^ syntax error

ADD REPLY • link 5.8 years ago by hellbio ▴ 520

0

Entering edit mode

I maybe forgot a trailing parenthesis:

$ fetchChromSizes hg38 \
    | awk -vFS="\t" -vOFS="\t" -vbounds=20000000 '{ printf("%s\t0\t%d\n%s\t%d\t%d\n", $1, bounds, $1, ($2 - bounds - 1), $2); }' \
    | grep -vE '_' \
    | sort-bed - \
    > hg38.custom_telomeres.nuc.bed

ADD REPLY • link 5.8 years ago by Alex Reynolds 36k

0

Entering edit mode

It still fails with the below error:

INFO: trying MySQL /usr/bin/mysql for database hg38 Non-numeric start coordinate. See line 50 in -. (remember that chromosome names should not contain spaces.)

However, if i have the chromosomes sizes in the below format is there a way without using fetchChromSizes, as my genome build is not available in mysql.

chr1    122678785
chr2    85426708
chr3    91889043
chr4    88276631

ADD REPLY • link 5.8 years ago by hellbio ▴ 520

0

Entering edit mode

Just run awk on that file directly, instead of piping in coordinates from fetchChromSizes.

ADD REPLY • link 5.8 years ago by Alex Reynolds 36k