I have some RNA sequencing reads to align to the human reference genome. I found the genome FASTA files on both GENCODE and ENSEMBL: GRCh38.p13.genome.fa.gz
and Homo_sapiens.GRCh38.dna.toplevel.fa.gz
But after unzipping them, I found that they are 3.1G and 60G respectively. Why is that? And which one should I use? (considering the purpose of the project is to detect gene fusion from the sequencing reads).
Haha, fun truth. But then I'm wondering, can I use the same GTF annotation file on the
toplevel
andprimary
FASTA file? Or rather, will the N confuse the coordinates in GTF?Do not use
toplevel
file unless you have a specific reason to do so i.e. you need to use the haplotypes.Trying to compare the 3 files in question, and found that there are 639 sequences in both GENCODE genome and ENSEMBL
toplevel
, but only 194 sequences in ENSEMBLprimary
.Yet, I want to add a further question after we decide on
primary
FASTA: there are 3 FASTA files flagged withprimary
, which areHomo_sapiens.GRCh38.dna.primary_assembly.fa.gz
(unmasked genomic DNA sequences),Homo_sapiens.GRCh38.dna_rm.primary_assembly.fa.gz
(masked genomic DNA),Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa.gz
(soft-masked genomic DNA). As this article says, we should avoid usingrm
. But how to choose between the other two, when it comes to (short/long-read) RNA-Sequencing alignment?Use primary unmasked genome. See: Masking reference for RNA-seq alignments