Hello every body,
I want to have SNPs of 1000 genome project only for exons. The database of 1000 genome project includes SNPs for the whole genome. Is there an easy way to filter and download data only for exons. I don't want to spend time on writing scripts to filter data. I would be grateful if some body can help.
Best, Ehsan
The answer is probably bedtools
What is your next step, why do you want subset of 1000 genome data? At the moment "I want this, and don't want to code" seems to me an unclear request.
Refer to dbSNP. In dbSNP, kgvalidated and kgprod tags denote the variants are from 1000 genomes project. Then filter by syn, nsf, nsm, nsn , u3 and u5 tags. These tags are for coding variants with calculated variant effect. For filtering you can use bcftools.
otherway is to intersect dbsnp vcf with exon coordinates.
Javad : Don't forget to follow up on this thread.
If an answer was helpful, you should upvote it; if the answer resolved your question, you should mark it as accepted. You can accept more than one answer as long as it works.
Yeah, But I didn't want to write scripts. I thought maybe this data is already stored somewhere. Thank you anyway.
Filtering by tags is one line code if one knows how to use bcftools.
Please use
ADD COMMENT/ADD REPLY
when responding to existing posts to keep threads logically organized.It's just a single command but okay.
I don't think it would be just a single command. because the coordinates of exons are not included in the vcf file. Am I missing some thing? Could you please give me some hints to go through it? Thanks
You would need a bed file of the targets of interest, essentially the exome. You can get those from UCSC.
no vcf file will have exon coordinates, in general. VCF fill have coordinates for variants only. When you filter for variants in coding and UT regions, this automatically covers exonic regions, mostly.