Puzzled by the different chicken genome assemblies
2
0
Entering edit mode
9.0 years ago
asta.laiho • 0

I have a chicken (Gallus gallus) RNA-seq data set to analyze and I'm trying to figure out which source to use as genome reference and gene annotations. iGenomes (https://support.illumina.com/sequencing/sequencing_software/igenome.html) and Tophat site (https://ccb.jhu.edu/software/tophat/igenomes.shtml) have different options available but they do not provide statistics on the number of contigs/chromosomes and gene models of the different assemblies. iGenomes UCSC galgal4 for example contains only ~6000 gene models although some other assemblies list more than 15k gene models. I would be very glad if someone who has more information on the different chicken genome assemblies and sources could help me to decide on the genome version to use.

RNA-Seq genome • 2.5k views
ADD COMMENT
0
Entering edit mode

Which annotations list more than 15k genes? I do not know for Gallus gallus, but for either Homo sapiens or Bos taurus (or maybe both, I do not remember), NCBI annotations include SNP variants as genes, so the number of gene models is inflated compared to other annotations.

ADD REPLY
0
Entering edit mode

Using the sequence/annotation/index bundle from iGenomes or TopHat site will ensure that you don't run into problems with sequence differences/multiple gene models with identical names etc. Once you do your alignments, you could, in theory, use any annotation file (as long as it is for the same genome build) to get your gene counts. UCSC likely covers all important genes. Additional models you see in other sources likely include splice variants, non-coding, pseudo-genes etc.

ADD REPLY
0
Entering edit mode
9.0 years ago
cyril-cros ▴ 950

ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF_000002315.3_Gallus_gallus-4.0

I would say go for RefSeq. As a general way of doing that, go to the NCBI Assembly website, search your species, choose the latest assembly if there are several. On the top right of the page there is a link to the RefSeq FTP website. You can also navigate ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/ , but you don't see the Assembly version info.

RefSeq has been curated and is in a nice standardized format, in case you are dealing with several species. Some newly sequenced organism take a bit of time to be included, but not Gallus gallus. The README describes the various files. They use a weird GCF_xxxxxxxx prefix. Isoforms are also included, you can't just count the number of entries to get the total number of genes.

As a side note, some assemblies might be better than others. I took a look at the Coelacanth genome. There were two papers on it. One has an official RefSeq annotation, the other has gene models in GenBank. The second one was done with much better coverage and more tissues. Because coelacanth are a protected species under CITES convention, the first group only had 20 years old frozen muscular tissue. The second one took fresh samples, but tried a novel annotation workflow.

ADD COMMENT
0
Entering edit mode

The problem with RefSeq is if you are going to use TopHat for alignment (it sounds like that's the case based on the original question), it may complain about the GTF file format.

ADD REPLY
0
Entering edit mode

GTF should be supported, with the -G/--GTF <GTF/GFF3 file> option. I tried in January, I think, and it worked.

Anyway, I switched to STAR, which works at a ridiculous speed if you have 32Gb of RAM at least at your disposal. It needs a couple options to be compatible with Cufflinks.

ADD REPLY
0
Entering edit mode
9.0 years ago
igor 13k

Check Ensembl: http://useast.ensembl.org/Gallus_gallus/Info/Annotation

The data source description is very clear, there are over 15,000 coding genes, and the reference files are usually compatible with most software tools.

ADD COMMENT

Login before adding your answer.

Traffic: 2192 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6