I'm working with some human breakpoint data:
Chr.L Pos.L Strand.L Chr.H Pos.H Strand.H
18 19092052 + 18 30289323 +
I would like to know where breakpoints are generally occuring, e.g. joining together two exons, introns, UTRs, etc.
I have tried querying Ensembl via Biomart in R using:
attributes = c("transcript_biotype"), filters = c("chromosomal_region")
When I use the first position 18:19092052:19092052:1
it returns some transcripts which are out of range (e.g. 18822203-19035091) but seems to return the correct transcript with transcript start and end values overlapping the input, so I can work with that.
However for the second position 18:30289323:30289323:1
it does not return anything. Does this mean it is noncoding DNA? Is this happening because I am querying Ensembl Genes? I can live with that too but I'd just like it confirmed.
Otherwise, is there a better way I could do this? Perhaps using an SNP tool, like ANNOVAR?
Hi Emily - many thanks for your help! :) I've mislead you a bit with my question though, sorry! I have a lot of breakpoint data to analyse, that example was just the first line in one of the files. Also the forward strand is correct for both breakpoints (it is a large deletion). VEP looks interesting, I've previously been using tabix/vcftools to query 1000 genomes but that looks like it could be much easier. The SNPs idea is great too.
Wasn't sure if you had lots or just one or two. You can analyse the whole lot at once with the VEP. If you've got lots, I recommend the Perl script and downloading a cache:
http://www.ensembl.org/info/docs/variation/vep/vep_script.html
A breakpoint surely doesn't have a strand, as both strands are broken. If you specify a strand to BioMart it will only look for features on one strand, but if you don't specify a strand, it will look for features on both strands, which is what you want.