Question

Fasta file and GTF file for STAR alignment

0

Entering edit mode

6.5 years ago

snp87 ▴ 80

Hello, this is a very basic question but I was wondering if someone could help me understand if I've used the correct GTF file and Fasta file for the mouse genome indexing (I'm using STAR). I got the relevant Fasta file and GTF file from ensembl: Mus_musculus.GRCm38.92.gtf.gz from ftp://ftp.ensembl.org/pub/release-92/gtf/mus_musculus/ and Mus_musculus.GRCm38.dna.toplevel.fa.gz from ftp://ftp.ensembl.org/pub/release-92/fasta/mus_musculus/dna_index/

Thank you so much!

STAR ensembl • 14k views

ADD COMMENT • link updated 21 months ago by Roman Luštrik ▴ 130 • written 6.5 years ago by snp87 ▴ 80

0

Entering edit mode

tagging: Emily_Ensembl

ADD REPLY • link 6.5 years ago by GenoMax 147k

0

Entering edit mode

Hello, I also stuck with something very basic .I need to produce index file before I run STAR and map the reads and I downloaded these 3 files from Ensembl as a reference genome : Homo_sapiens.GRCh38.dna.primary_assembly.fa Homo_sapiens.GRCh38.104.gtf
Homo_sapiens.GRCh38.cdna.all.fa

Is it correct files pl? This file gave me an error : Homo_sapiens.GRCh38.dna.primary_assembly.fa [W::sam_read1] Parse error at line 1 [main_samview] truncated file.

ADD REPLY • link 3.5 years ago by lidiaryabova • 0

0

Entering edit mode

Please open a new question.

ADD REPLY • link 3.5 years ago by ATpoint 85k

score 8 · Answer 1 · 2018-06-06

Hello there,

The top-level fasta file will include chromsomes, regions not assembled into chromosomes and N padded haplotype/patch regions. See more here: ftp://ftp.ensembl.org/pub/release-92/fasta/mus_musculus/dna/README. If you are only looking for reference genome assembly chromosome level sequences then use the primary_assembly.fa file.

The files in the dna_index directory are genomic sequence files which are bgzipped and tabix indexed (for more details on what this means see: http://www.htslib.org/doc/tabix.html). These are downloaded by the Variant Effect Predictor (VEP) installer to allow quicker VEP'ing. The fasta file without the .fai or .gzi suffix, although stated to be a different size, is identical to the fasta file in the fasta/mus_musculus/dna/ folder so you can download either and you'd get the same data.

We'll update the README files, or 'hide' the dna_index folder to avoid confusion between these files in the two folders. Thanks for bringing it to our attention!

score 1 · Answer 2 · 2018-06-06

From ensembl (emphasis mine)

TOPLEVEL

These files contains all sequence regions flagged as toplevel in an Ensembl schema. This includes chromsomes, regions not assembled into chromosomes and N padded haplotype/patch regions.

From the STAR manual (emphasis mine)

2.2.1 Which chromosomes/scaffolds/patches to include?

It is strongly recommended to include major chromosomes (e.g., for human chr1-22,chrX,chrY,chrM,) as well as un-placed and un-localized scaffolds. Typically, un-placed/un-localized scaffolds add just a few MegaBases to the genome length, however, a substantial number of reads may map to ribosomal RNA (rRNA) repeats on these scaffolds. These reads would be reported as unmapped if the scaffolds are not included in the genome, or, even worse, may be aligned to wrong loci on the chromosomes. Generally, patches and alternative haplotypes should not be included in the genome. Examples of acceptable genome sequence files: • ENSEMBL: files marked with .dna.primary.assembly, such as:

ftp://ftp.ensembl.org/ pub/release-77/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly. fa.gz

So I'd say no, you don't have the right reference. Use "primary assembly" as recommended.

score 0 · Answer 3 · 2018-05-31

0

Entering edit mode

6.5 years ago

Bastien Hervé 5.9k

If you want to analyse haplotypes you have the good fasta file.

The GTF is the good one

Becareful, chromosome names are not "standard" and could struggle some aligners. In your file chr1 is named 1, maybe you would have to rename each chromosome chr1, ch2 etc

ADD COMMENT • link 6.5 years ago by Bastien Hervé 5.9k

0

Entering edit mode

Thanks for your reply. I'm not sure if I understand what you mean. I'm hoping to do a differential expression analysis after the alignments. In the fasta files there were different options, cdna, cds, dna, dna index, ncrna and pep (ftp://ftp.ensembl.org/pub/release-92/fasta/mus_musculus/). I went with the fasta file in dna index but not really sure if this is what should be done. Anyone know how you decide about this?

ADD REPLY • link 6.5 years ago by snp87 ▴ 80

0

Entering edit mode

Sorry for my late reply,

The file you need is Mus_musculus.GRCm38.dna.toplevel.fa.gz

Infortunaly, this file exist, with 2 different sizes of file, in ftp://ftp.ensembl.org/pub/release-92/fasta/mus_musculus/dna_index/, but also in ftp://ftp.ensembl.org/pub/release-92/fasta/mus_musculus/dna/

Maybe try to contact Emily_Ensembl which is the person to contact for Ensembl stuff

Try to add the tag ensembl in your post's tags. I bet that she is following this tag

ADD REPLY • link 6.5 years ago by Bastien Hervé 5.9k