Hello, I also stuck with something very basic .I need to produce index file before I run STAR and map the reads and I downloaded these 3 files from Ensembl as a reference genome :
Homo_sapiens.GRCh38.dna.primary_assembly.fa
Homo_sapiens.GRCh38.104.gtf
Homo_sapiens.GRCh38.cdna.all.fa
Is it correct files pl?
This file gave me an error : Homo_sapiens.GRCh38.dna.primary_assembly.fa
[W::sam_read1] Parse error at line 1
[main_samview] truncated file.
The top-level fasta file will include chromsomes, regions not assembled into chromosomes and N padded haplotype/patch regions. See more here: ftp://ftp.ensembl.org/pub/release-92/fasta/mus_musculus/dna/README. If you are only looking for reference genome assembly chromosome level sequences then use the primary_assembly.fa file.
The files in the dna_index directory are genomic sequence files which are bgzipped and tabix indexed (for more details on what this means see: http://www.htslib.org/doc/tabix.html). These are downloaded by the Variant Effect Predictor (VEP) installer to allow quicker VEP'ing. The fasta file without the .fai or .gzi suffix, although stated to be a different size, is identical to the fasta file in the fasta/mus_musculus/dna/ folder so you can download either and you'd get the same data.
We'll update the README files, or 'hide' the dna_index folder to avoid confusion between these files in the two folders. Thanks for bringing it to our attention!
These files contains all sequence regions flagged as toplevel in an Ensembl schema. This includes chromsomes, regions not
assembled into chromosomes and N padded haplotype/patch regions.
From the STAR manual (emphasis mine)
2.2.1 Which chromosomes/scaffolds/patches to include?
It is strongly recommended to include major chromosomes (e.g., for human
chr1-22,chrX,chrY,chrM,) as well as un-placed and un-localized
scaffolds. Typically, un-placed/un-localized scaffolds add just a few
MegaBases to the genome length, however, a substantial number of reads
may map to ribosomal RNA (rRNA) repeats on these scaffolds. These
reads would be reported as unmapped if the scaffolds are not included
in the genome, or, even worse, may be aligned to wrong loci on the
chromosomes. Generally, patches and alternative haplotypes should not
be included in the genome. Examples of acceptable genome sequence
files:
• ENSEMBL: files marked with .dna.primary.assembly, such as:
Opened discussion : If OP want to call variants, is it not a bit dangerous to use only "primary assembly" ?
Let's say that chr6_FIXED is a fixed part of chr6 that will be added in the next major release. This modification change a A to a T. In primary assembly you don't have this chr6_FIXED but you have it in toplevel.
I mean if a read has a perfect match on chr6_FIXED and a match with one mismatch on chr6. If you keep the primary assembly you could have call a variant that you would have never called with toplevel. Leading to false positive result.
If you want to analyse haplotypes you have the good fasta file.
The GTF is the good one
Becareful, chromosome names are not "standard" and could struggle some aligners. In your file chr1 is named 1, maybe you would have to rename each chromosome chr1, ch2 etc
Thanks for your reply. I'm not sure if I understand what you mean. I'm hoping to do a differential expression analysis after the alignments. In the fasta files there were different options, cdna, cds, dna, dna index, ncrna and pep (ftp://ftp.ensembl.org/pub/release-92/fasta/mus_musculus/). I went with the fasta file in dna index but not really sure if this is what should be done. Anyone know how you decide about this?
tagging: Emily_Ensembl
Hello, I also stuck with something very basic .I need to produce index file before I run STAR and map the reads and I downloaded these 3 files from Ensembl as a reference genome : Homo_sapiens.GRCh38.dna.primary_assembly.fa Homo_sapiens.GRCh38.104.gtf
Homo_sapiens.GRCh38.cdna.all.fa
Is it correct files pl? This file gave me an error : Homo_sapiens.GRCh38.dna.primary_assembly.fa [W::sam_read1] Parse error at line 1 [main_samview] truncated file.
Please open a new question.