I am trying to build a genome index for use with STAR, and I am a bit confused on which files I should use.
According to the STAR manual (ยง2.2.1)
It is strongly recommended to include major chromosomes (e.g., for human chr1-22,chrX,chrY,chrM,) as well as un-placed and un-localized scaffolds. Typically, un-placed/un-localized scaffolds add just a few MegaBases to the genome length, however, a substantial number of reads may map to ribosomal RNA (rRNA) repeats on these scaffolds. These reads would be reported as unmapped if the scaffolds are not included in the genome, or, even worse, may be aligned to wrong loci on the chromosomes. Generally, patches and alternative haplotypes should not be included in the genome.
I have downloaded the following:
wget ftp://ftp.ensembl.org/pub/release-96/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.chromosome.{1..22}.fa.gz
wget ftp://ftp.ensembl.org/pub/release-96/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.chromosome.{MT,X,Y}.fa.gz
I have not downloaded the masked genomes (_rm and _sm), but what about the following files?
Homo_sapiens.GRCh38.dna.nonchromosomal.fa.gz
: are these the scaffold reads the STAR manual is talking about? The README
file on the ENSEMBL FTP seems to imply scaffold reads are in seqlevel files, but I cannot see any.
Homo_sapiens.GRCh38.dna.toplevel.fa.gz
: the README states this
contains all sequence regions flagged as toplevel in an Ensembl schema. This includes chromsomes, regions not assembled into chromosomes and N padded haplotype/patch regions.
So, according to the STAR manual I should not include this, is this correct?
Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz
This contains
all toplevel sequence regions excluding haplotypes and patches.
So could I just use this instead of the chromosome files above? Or should I use it in addition?
Just use
Homo_sapiens.GRCh38.dna.primary_assembly.fa
for reference, it doesn't make sense to concatenate all the other files to get the same file.Thank you Benn, just out of curiosity, could you confirm whether my understanding of what the different files are is correct?
I don't know the answers to all your questions about what's in the different files or not, if you are interested you can download them and see what's in it. The STAR manual tells us that
Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz
is an acceptable file to use, so that's why I recommended you to use it. Good luck with the mapping.You will get the reference genome here: https://data.broadinstitute.org/Trinity/CTAT_RESOURCE_LIB/
SOURCE: [Click here ---> https://github.com/STAR-Fusion/STAR-Fusion/wiki] ----> go to Data Recource Required