Question

Ensembl reference genomes and annotations by chromosome only.

0

Entering edit mode

4.4 years ago

devarts ▴ 40

Hi All. Ensembl has the reference genome and annotations that I need separated into files by chromosome. I'd like to combine these into one file with an additional column to specify which chromosome they are found on, so that I can align and generate feature counts for the organisms whole genome, from which I can do differential expression and gene newtwork analysis for the organisms whole genome.

Does anyone know of a way I can do this? Software, scripts?

ensembl alignment RNA-Seq feature counts • 1.1k views

ADD COMMENT • link updated 4.4 years ago by h.mon 35k • written 4.4 years ago by devarts ▴ 40

1

Entering edit mode

What have you tried? Shell loops in conjuction with awk can do this if used well.

ADD REPLY • link 4.4 years ago by Ram 44k

0

Entering edit mode

I didn't have any initial ideas, other than running alignment and feature counting for each chromosome individually and then trying to use deplyr to combine data tables in R, as I was planning on doing the differential expression with EdgeR in R. But, I'll try using awk and shell loops to do this prior to alignment. Thank you. Any additional hints would be greatly appreciated.

ADD REPLY • link 4.4 years ago by devarts ▴ 40

score 2 · Accepted Answer · 2020-08-25

2

Entering edit mode

4.4 years ago

h.mon 35k

I have downloaded several genomes and annotations from Ensembl, each contained in single genome (fasta) and annotation (gff or gtf) files. It may be a bit confusing at first, with all the .dna.chromosome.1.fa.gz and .chromosome.X.gff3.gz, but you just want the files with no chromosome in their names.

For example, for pig, you will want the Sus_scrofa.Sscrofa11.1.dna.toplevel.fa.gz for reference genome, and Sus_scrofa.Sscrofa11.1.101.gff3.gz for the gff annotation (the gtf directory doesn't split the annotation by chromosome).

ADD COMMENT • link 4.4 years ago by h.mon 35k

0

Entering edit mode

h.mon, I think that's the ticket. I was uncertain about using the toplevel files. I did see that those are available for my organism.

ADD REPLY • link 4.4 years ago by devarts ▴ 40