Question

Why is human genome FASTA file on GENCODE much smaller than that on ENSEMBL?

1

Entering edit mode

4.1 years ago

Xiaokang ▴ 80

I have some RNA sequencing reads to align to the human reference genome. I found the genome FASTA files on both GENCODE and ENSEMBL: GRCh38.p13.genome.fa.gz and Homo_sapiens.GRCh38.dna.toplevel.fa.gz But after unzipping them, I found that they are 3.1G and 60G respectively. Why is that? And which one should I use? (considering the purpose of the project is to detect gene fusion from the sequencing reads).

reference genome GENCODE ENSEMBL • 3.8k views

ADD COMMENT • link updated 4.1 years ago by GenoMax 151k • written 4.1 years ago by Xiaokang ▴ 80

score 6 · Accepted Answer · 2021-05-06

6

Entering edit mode

4.1 years ago

GenoMax 151k

toplevel file from Ensembl includes haplotypes with full length of chromosome padded out using N's. That is the reason it is huge compared to GENCODE file. Use the Ensembl primary file which is equivalent to GENCODE.

From README at Ensembl:

---------
TOPLEVEL
---------
These files contains all sequence regions flagged as toplevel in an Ensembl
schema. This includes chromsomes, regions not assembled into chromosomes and
N padded haplotype/patch regions.

ADD COMMENT • link 4.1 years ago by GenoMax 151k

0

Entering edit mode

Haha, fun truth. But then I'm wondering, can I use the same GTF annotation file on the toplevel and primary FASTA file? Or rather, will the N confuse the coordinates in GTF?

ADD REPLY • link 4.1 years ago by Xiaokang ▴ 80

0

Entering edit mode

Do not use toplevel file unless you have a specific reason to do so i.e. you need to use the haplotypes.

ADD REPLY • link 4.1 years ago by GenoMax 151k

0

Entering edit mode

Trying to compare the 3 files in question, and found that there are 639 sequences in both GENCODE genome and ENSEMBL toplevel, but only 194 sequences in ENSEMBL primary.

ADD REPLY • link 4.1 years ago by Xiaokang ▴ 80

0

Entering edit mode

Yet, I want to add a further question after we decide on primary FASTA: there are 3 FASTA files flagged with primary, which are Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz (unmasked genomic DNA sequences), Homo_sapiens.GRCh38.dna_rm.primary_assembly.fa.gz (masked genomic DNA), Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa.gz (soft-masked genomic DNA). As this article says, we should avoid using rm. But how to choose between the other two, when it comes to (short/long-read) RNA-Sequencing alignment?

ADD REPLY • link 4.1 years ago by Xiaokang ▴ 80

2

Entering edit mode

Use primary unmasked genome. See: Masking reference for RNA-seq alignments

'dna_rm' - masked genomic DNA. Interspersed repeats and low complexity regions are detected with the RepeatMasker tool and masked by replacing repeats with 'N's.
- 'dna_sm' - soft-masked genomic DNA. All repeats and low complexity regions have been replaced with lowercased versions of their nucleic base

ADD REPLY • link 4.1 years ago by GenoMax 151k

score 3 · Accepted Answer · 2021-05-06

3

Entering edit mode

4.1 years ago

Juke34 9.2k

WTF with the ensembl human genome?

ADD COMMENT • link 4.1 years ago by Juke34 9.2k