Hello everybody,
I'm fairly new in the RNA-seq analysis sector and I'm facing some problems with STAR aligner. I'm trying to align a set of reads on the Saccharomyces cerevisiae genome (Ensembl release 74) using spike-in (ERCC92) as QC method.
I've downloaded the .fa file from Ensembl and appended the spike-in sequences at the end (file: https://www.dropbox.com/s/ge18ellc8dcxnit/SCerevisiae.cdna.all.fa?dl=0 )
I've also created the GTF with reads and spike-in annotation in a similar fashion (file: https://www.dropbox.com/s/d3zcv4gemddfmbp/Saccharomyces_cerevisiae.R64-1-1.87.gtf?dl=0 ) The ERCC92 annotation where downloaded from here: https://github.com/NCBI-Hackathons/HASSL_Homogeneous_Analysis_of_SRA_rnaSequencing_Libraries/blob/master/Spike-Ins/ERCC92/ERCC92.gtf
The genome index creation was performed with the following command:
STAR --runThreadN 10 --runMode genomeGenerate --genomeDir /homes/fgastaldello/STAR_test/S.Cer_db --genomeFastaFiles /homes/fgastaldello/SCerevisiae.cdna.all.fa
The mapping command was as follow:
STAR --sjdbGTFfile /homes/fgastaldello/Saccharomyces_cerevisiae.R64-1-1.87.gtf --quantMode GeneCounts --runThreadN 10 --genomeDir /homes/fgastaldello/STAR_test/S.Cer_db --genomeLoad NoSharedMemory --readFilesCommand gunzip -c --readFilesIn /homes/fgastaldello/dyeswap_data/Snf2/Snf2-${i}/${fastqfiles[0]} /homes/fgastaldello/dyeswap_data/Snf2/Snf2-${i}/${fastqfiles[1]} --outSAMmode Full --outSJfilterIntronMaxVsReadN 50 100 150 200 --outFilterType BySJout --outFilterMultimapNmax 2 --outFilterMismatchNmax 5 --outSAMunmapped Within
The bash script that runs this command iterates in 6 folders and substitute the variables ${fastqfiles[N]} with a sample files containing pair-end reads.
The command runs fine but the final result of the entire computation map only the spike-in. All the other genes are missing (file: https://www.dropbox.com/s/0e529ku9ub9z77y/ReadsPerGene.out.tab?dl=0 ).
Since I'm new I'm probably doing something silly but I can't get my head around it.
Thanks for any help.
Grab a few reads from your file and check them using BLAST at NCBI to confirm they hit yeast genome. Hopefully there are no surprises (e.g. data you were given does not belong to you).
I've done as you suggested and they map to the yeast genome.
Then I suggest that you start with simplest STAR options to see if that restores mapping. Perhaps you are using a filter that is removing the mapping (
--outFilterMultimapNmax 2
this may be one of those culprits).You could also give
bbmap.sh
from BBMap suite a try.Try to count with featureCounts, the star bam file (you may have to sort it) and your gtf annotation.
edit: also, check STAR log output, it should have information about %reads mapping to the genome, properly paired, multi-mappers, etc.
For future ref: featureCounts does not require a pre-sorted file.
I did not remember that, although it is an amazing software, I have been using only STAR for a while now.
The log files from STAR are showing an average of 82 % unique mapped reads and low (less than 0.1%) unmapped reads for each sample.
I'm struggling to make featureCounts work since I operate on two Windows machines.
I'm quite incline to think that the GTF is not properly formatted. For example:
...
Mito ensembl start_codon 85554 85556 . + 0 gene_id "Q0297"; transcript_id "Q0297"; exon_number "1"; gene_source "ensembl"; gene_biotype "protein_coding"; trans$
Mito ensembl stop_codon 85707 85709 . + 0 gene_id "Q0297"; transcript_id "Q0297"; exon_number "1"; gene_source "ensembl"; gene_biotype "protein_coding"; trans$ ERCC-00002 ERCC exon 1 1061 0.000000 + . gene_id "ERCC-00002"; transcript_id "DQ459430";
ERCC-00003 ERCC exon 1 1023 0.000000 + . gene_id "ERCC-00003"; transcript_id "DQ516784";
...
There is a clear difference between the two styles. What is the correct way to append the genome GTF and the ERCC one? is there a preferred source for the latter?