Warnings while generating STAR indices
1
0
Entering edit mode
6.6 years ago
skhan ▴ 10

I'd like to align 2x75b TruSeq RNA Seq data collected on an Illumina instrument to the rat reference genome using STAR, for downstream differential expression analysis. I obtained the reference genome through iGenome, and ran the following command to generate STAR indices:

STAR --runMode genomeGenerate \
--genomeDir STAR_indices \
--genomeFastaFiles iGenome/Rattus_norvegicus/Ensembl/Rnor_6.0/Sequence/WholeGenomeFasta/genome.fa \
--sjdbGTFfile iGenome/Rattus_norvegicus/Ensembl/Rnor_6.0/Annotation/Genes/genes.gtf \
--sjdbOverhang 74 \
--runThreadN 8

I get the following warmings:

WARNING: while processing sjdbGTFfile=iGenome/Rattus_norvegicus/Ensembl/Rnor_6.0/Annotation/Genes/genes.gtf: chromosome 'AABR07022620.1' not found in Genome fasta files for line:
AABR07022620.1  ensembl exon    122 427 .   -   .   exon_id "ENSRNOE00000544043"; exon_number "1"; exon_version "1"; gene_biotype "protein_coding"; gene_id "ENSRNOG00000058846"; gene_name "AABR07022620.1"; gene_source "ensembl"; gene_version "1"; p_id "P25520"; transcript_biotype "protein_coding"; transcript_id "ENSRNOT00000091897"; transcript_name "AABR07022620.1-201"; transcript_source "ensembl"; transcript_version "1"; tss_id "TSS27633";

I get about 800 warnings of this type. Turns out the iGenome .fa file only lists chromosomes 1-20 + MT + X + Y and nothing else (so 23 in total), while the iGenome .gtf has hundreds of listing for "chromosomes" in addition to those 23. One such example is "chromosome" AABR07022620.1, which is found in the .gtf file but not the .fa file.

Should I be concerned about this? Or can I ignore these warnings, and be confident in the differential expression results I get while using the iGenome files?

STAR RNA-Seq • 3.5k views
ADD COMMENT
3
Entering edit mode
6.6 years ago

If you want to be on the safe side you can download the genome and annotation file from Ensembl and rerun your analysis with that. Those won't produce the warnings that you've seen. I'd be surprised if there were any tangible change in the results, but "better safe than sorry" as the saying goes.

For reference, the primary issue with omitting those contigs from the reference genome is that it encourages false-positive alignments of reads originating from those contigs to other areas of the genome. In mouse and human this isn't a high risk, but it's >0 and I assume it's a higher risk still in the rat genome, which isn't going to be quite as high quality. So I would personally reprocess everything with a more comprehensive genome.

ADD COMMENT
0
Entering edit mode

Thank you, Devon. I switched over to these two reference files directly from Ensemble:

ftp://ftp.ensembl.org/pub/release-86/fasta/rattus_norvegicus/dna/Rattus_norvegicus.Rnor_6.0.dna_sm.toplevel.fa.gz
ftp://ftp.ensembl.org/pub/release-86/gtf/rattus_norvegicus/Rattus_norvegicus.Rnor_6.0.86.gtf.gz

The only remaining warnings are the following 14 (which I get when using the above two files):

WARNING: long repeat for junction # 14982 : 1 197264488 197264790; left shift = 41; right shift = 255
WARNING: long repeat for junction # 43096 : 11 61465604 61510671; left shift = 255; right shift = 255
WARNING: long repeat for junction # 46600 : 12 5574666 5575140; left shift = 62; right shift = 255
WARNING: long repeat for junction # 56758 : 13 83101461 83101841; left shift = 255; right shift = 45
WARNING: long repeat for junction # 63259 : 14 72889778 72948202; left shift = 2; right shift = 255
WARNING: long repeat for junction # 105009 : 2 232136350 232136965; left shift = 255; right shift = 255
WARNING: long repeat for junction # 142601 : 5 69246706 69247220; left shift = 72; right shift = 255
WARNING: long repeat for junction # 144259 : 5 109501377 109501940; left shift = 52; right shift = 255
WARNING: long repeat for junction # 144261 : 5 109501888 109502451; left shift = 255; right shift = 2
WARNING: long repeat for junction # 169613 : 7 120763801 120764138; left shift = 31; right shift = 255
WARNING: long repeat for junction # 169876 : 7 122160547 122162660; left shift = 2; right shift = 255
WARNING: long repeat for junction # 195236 : X 22418971 22419698; left shift = 255; right shift = 31
WARNING: long repeat for junction # 197712 : X 84667620 84667985; left shift = 39; right shift = 255
WARNING: long repeat for junction # 199658 : X 153065750 153066229; left shift = 255; right shift = 86

I'm guessing I can ignore these?

ADD REPLY
1
Entering edit mode

I think you can ignore those warnings.

ADD REPLY

Login before adding your answer.

Traffic: 2420 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6