If you're worried about disk usage, BEDOPS does set operations on compressed files in a format called Starch, which offers a higher compression ratio than bzip2 and, as mentioned, allows direct set operations, without any extraction to an intermediate file.
With BEDOPS installed, you could create a Starch file of SNPs like so:
$ wget -qO- http://hgdownload.cse.ucsc.edu/goldenpath/hg19/database/snp147.txt.gz \
| gunzip -c \
| awk -v OFS="\t" '{ print $$2,$$3,($$3+1),$$5 }' \
| sort-bed - \
| starch - \
> hg19.snp147.starch
(Adjust as needed for the SNPs you need, for the reference genome you're working with.)
An uncompressed raw BED file of SNPs is roughly 5.5 GB. Compressed as a Starch file, as shown above, it should be about 5-10% of that, around 250-500 MB.
Given a sorted, tab-delimited genomic regions file regions.bed
like this:
chr1 158669597 158669598
chr11 72946311 72946312
Then you could do a bedmap
operation with the compressed SNP annotations like this:
$ bedmap --echo --echo-map-id --delim '\t' regions.bed hg19.snp147.starch
chr1 158669597 158669598
chr11 72946311 72946312 rs765534001
In this example chr1:158669597-158669598
does not associate with a SNP ID, while chr11:72946311-72946312
associates with SNP rs765534001
.
So there are data hygiene issues with your input you should fix before using this toolkit:
- Your regions should be half-open, zero-based. This means that the start and stop columns should not be equal. You can use
awk
to increment the stop field by 1, where start and stop are equal.
- Your regions should be sorted with BEDOPS
sort-bed
for set operations to run quickly and correctly.
- Your regions should start with
chr
and not Chr
. This can be fixed with awk
, as well.
you want
bedtools interesct
i don't think it would work if the SNP length is not '1'
be careful, this doesn't look like a BED interval because start==end. while the UCSC data are 0-based and use open-ended intervals.
Thank you for reply. Yes, I know that it isn't proper BED file and it complicating the problem. Is it possible to convert it into .vcf and then run some variant annotation tools such as SnpSift and SnpEff?
what about using dbsnp as VCF ? https://www.ncbi.nlm.nih.gov/variation/docs/human_variation_vcf/
Yes I have but the main problem is to compare it into my 'pseudo' BED file
fix the bed and use
bedtools intersect
?