I have an extended list of gene locations (chr#, start and end site) with respect to hg19 as extracted from not very specific alignments. I would like to find which genes fall in each of the regions (which consists of >20,000). For instance:
chr1:33183334-33190838
chr10:93066718-93371217
I would like to see what genes fall between each.. indeed, if I had just a handful, I could go to genome.browser and just look there and jot down the genes. However, I need this automated as I have > 20,000 of those and not all of them fall perfectly within the exact start to end of a gene. So if it was in the middle of a gene, I still need to account for that. I think if I have a file of the following columns: GeneID (NOT Transcript), Gene Name, chromosome, Start and End (i.e. bed file) but with respect to hg19 and with the GENE id (not transcript ID) as identifier then I could use that and write a python code. But I couldn't find such a file with GENE ID at beginning. If anyone has any suggestions, please let me know.
No need to write code. This will do what you need: http://bedtools.readthedocs.org/en/latest/content/tools/intersect.html
oh nice, but what would my reference to intersect to be? That's the problem I can't find a file with "GeneID" "GeneName" Chromosome location.
You create one. See my answer.
Here is what I did:
I downloaded a reference file from UCSC hg19 and fixed columns to have this structure:
Then, I got my input file (which has the query locations) as such:
Then I used the command:
My understanding is that each row in the querrylist file will be intersected with the ensGene.txt. However, the output file had the same # of lines as ensGene (which is 3 x my query list) and did not get me the output of interest. I tried swathing
-a
and-b
but that did not help.Any advice?
I double checked. my bad, the result is just a representation of multiple overlaps. I thought it just matches one to one. Thanks.