I have a large list of genomic positions for the latest assembly of the rat genome (rn4). I would like to map those positions onto the genome and get a list of protein coding genes that lie in the same regions.
More specifically, my questions are:
I understand that the Rat Genome Sequencing Consortium (http://www.hgsc.bcm.tmc.edu/project-species-m-Rat.hgsc?pageLocation=Rat) provides the reference assembly for the rat genome. How does the annotated data available at UCSC Genome Browser, NCBI Genome and ENSEMBL compare and what kind of annotation do they offer?
Which database would you use for the task described above (return genes for genomic positions)?
What program would you use to do that (e.g. R/BioConductor ...)?
I haven't worked much with sequence data before and I am a little confused with the diversity of annotation databases. Any help would be great!
As I believe that the rn4 assembly is the same as the Baylor 3.4 assembly, you can easily retrieve the genes (plus annotation) in your regions of interest using Ensembl BioMart:
Choose the ‘Rattus norvegicus genes (RGSC3.4)’ dataset.
Click on ‘Filters’ in the left panel.
Expand the ‘REGION’ section by clicking on the + box.
Enter your list of genomic region in 'Multiple chromosomal regions' text box.
Click on ‘Attributes’ in the left panel.
Select any attributes you want to output.
Click the [Results] button on the toolbar.
Check 'Unique results only'.
Select ‘View All rows as HTML’ or export all results to a file (note that you can export to an Excel spreadsheet by choosing 'XLS' as your file format).
You can also find a video on how to use BioMart on YouTube.
As far as annotation is concerned, I would use as many as is practical. Here, annotation is the data such as function associated with each gene. Once you have a list of genes, you'd like to know their attributes, or have those handy in a data table. In this regard, it could be quite informative to grab similar data for the mouse genes defined by these regions in rat. Two examples: mouse knockouts will give you important functional data and mouse QTLs will help link gene to disease.
For rat, I don't do much genome-wide, but look at single genes. I prefer The Rat Genome Database at Med. College of Wisconsin for that info.
Thanks for your reply, but my question aims at finding genes within the rat genome (rather than associate more information with this genes). Once I have the genes, I will look into it. I think mapping to mouse/human could be interesting.
have a look at BEDTools and http://biostar.stackexchange.com/search?q=intersectBed