I'd like to align 2x75b TruSeq RNA Seq data collected on an Illumina instrument to the rat reference genome using STAR, for downstream differential expression analysis. I obtained the reference genome through iGenome, and ran the following command to generate STAR indices:
STAR --runMode genomeGenerate \
--genomeDir STAR_indices \
--genomeFastaFiles iGenome/Rattus_norvegicus/Ensembl/Rnor_6.0/Sequence/WholeGenomeFasta/genome.fa \
--sjdbGTFfile iGenome/Rattus_norvegicus/Ensembl/Rnor_6.0/Annotation/Genes/genes.gtf \
--sjdbOverhang 74 \
--runThreadN 8
I get the following warmings:
WARNING: while processing sjdbGTFfile=iGenome/Rattus_norvegicus/Ensembl/Rnor_6.0/Annotation/Genes/genes.gtf: chromosome 'AABR07022620.1' not found in Genome fasta files for line:
AABR07022620.1 ensembl exon 122 427 . - . exon_id "ENSRNOE00000544043"; exon_number "1"; exon_version "1"; gene_biotype "protein_coding"; gene_id "ENSRNOG00000058846"; gene_name "AABR07022620.1"; gene_source "ensembl"; gene_version "1"; p_id "P25520"; transcript_biotype "protein_coding"; transcript_id "ENSRNOT00000091897"; transcript_name "AABR07022620.1-201"; transcript_source "ensembl"; transcript_version "1"; tss_id "TSS27633";
I get about 800 warnings of this type. Turns out the iGenome .fa file only lists chromosomes 1-20 + MT + X + Y and nothing else (so 23 in total), while the iGenome .gtf has hundreds of listing for "chromosomes" in addition to those 23. One such example is "chromosome" AABR07022620.1, which is found in the .gtf file but not the .fa file.
Should I be concerned about this? Or can I ignore these warnings, and be confident in the differential expression results I get while using the iGenome files?
Thank you, Devon. I switched over to these two reference files directly from Ensemble:
The only remaining warnings are the following 14 (which I get when using the above two files):
I'm guessing I can ignore these?
I think you can ignore those warnings.