Hi, I am attempting to identify the exon that each of my synonymous or missense SNPs in the 1000 genomes data belongs to. I am using the GENCODE GTF files found here: https://www.gencodegenes.org/human/ and extracting all exons.
I then use bedtools to identify which exon each of my SNPs fall in. It appears that many of my SNPs' co-ordinates are not within any exon. What I would like to know is if and how synonymous or missense SNPs can fall in intronic regions?
Why are you comparing to the GTF? There are tools designed to do exactly what you need.
I need to obtain the exon that each SNP lies in, as well as the start and end co-ordinates (because my ultimate goal is to identify the length of the specific exon that each SNP lies in). The available GENCODE annotation of 1000 genomes variants provides the exon number within the gene, but not the exon id or start and end coordinates?
Simply get the gencode annotation for hg19, extract exons, and use
bedtools intersect
where-a
is the SNPs and-b
is the exon.gtf. Use option-wb
to return the entire interval of the matching exon. From there you can cut or awk out what you need.Can you confirm that the reference genomes are the same, so hg19 vs hg19 or hg38 vs hg38?
Hi, thanks for the response. Yes I can confirm that the ref genomes are the same, hg38.
Where are you getting your 1K genome SNPs from?
I am getting the data with GENCODE annotations here: http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/supporting/functional_annotation/
Could these be artifacts from a liftOver operation, perhaps?