I'm interested in an newer assembly of an organism available by NCBI but not by Ensembl. However, I still want to use the gene definitions by Ensembl. The annotation file on Ensembl has the coordiates based on the previous assembly. I see on NCBI Sequence Viewer 3.49.0 that it shows Ensembl gene annoations track suitable with the newer assembly I'm interested in. I assume that the coordinates of the gene annotation are corrected to match the newer assembly. However, NCBI Sequence Viewer 3.49.0 allows downloading a some range of a single chromosome. I wonder if there is a way of downloading the entire annoatation file matching the coordinates of the newer assembly by Ensembl, which is not available on the website of Ensembl.
Can you post a screenshot (and provide some information about what organism this is)? Point out the track you are referring to. If it is precomputed then it may be available.
Thank you for your response!
The organism is bovine (Bos taurus). A contamination was discovered for the latest two assemblies ARS-UCD1.2 (bosTau9), ARS-UCD1.3 (bosTau9), both referred to as bosTau9.
Now, there is a newer assembly, ARS-UCD2.0 (bosTau9), whose screenshot I have attached. The second track named Genes, Ensembl release 112, is the one I would like to retrieve.
The annotation files on Ensembl website for bovine is based on the assembly ARS-UCD1.3. I would like to download Ensembl gene annotation file compatible with ARS-UCD2.0.
Hello, may I ask which annotation file you finally chose? I am using the ARS-UCD2.0 from RefSeq , but the RefSeq annotation file is too difficult for me to deal with, as many transcript_id values are missing. Can I use the Ensembl annotation file instead if I change the chromosome names?
It is generally safer to use the sequence/annotations from the same provider. So if you want to use Ensembl annotation use the corresponding genome http://ftp.ensembl.org/pub/rapid-release/species/Bos_taurus/GCA_002263795.3/ensembl/genome/Bos_taurus-GCA_002263795.3-unmasked.fa.gz
Thx, actually I have used the genome from RefSeq, so I have to choose this annotation file from RefSeq. I need to quantify the gene expression but the annotation file from RefSeq contains some items without transcript_id, It also contains some pseudo gene ,tRNA I don't know how to deal with it ,I am trying to find a solution. Could you give me some suggestions?
Are you using "transcript_ID" as key for counting with
featureCounts
? Then you should only get counts for those rows that have that key. Summarize at the gene level unless you have a specific need to do transcript level counts.I’m using salmon pipeline and it will output the tans-level quantification values,but I need gene-level values, so I have to use tximport to get it which needs a file that contains information from transcript_id to gene_id. The gff file contains some items without transcript_id which will be dropped when generate the file that from transcript_id to gene id.