I'm sorry if this question comes of as naive or ignorant because I'm very new to Bioinformatics. I'm trying to do an alignment with STAR and was wondering if I could access a pre-made STAR index for the mm10 genome. I was told I could do this from UCSC but have had no luck finding it there.
So my question is, Are there pre-made STAR index files for the mm10 genome that I could download? And if so where and how?
Thanks in advance for any help and I'm sorry to ask such a trivial question! Let me know if there's anymore detail I can give!
I'd suggest generating your own index using the mm10 genome as per the instructions below, and using the latest gencode mouse genes. To keep things consistent (major problem in bioinformatics!!!) I'd download BOTH the genome and the annotation gtf from here http://www.gencodegenes.org/mouse_releases/current.html
You want the Comprehensive gene annotation - PRI gtf and the Genome sequence, primary assembly (GRCm38) - PRI fasta sequence (this is your genome).
This will use 4 cores to generate a genome and splice junction (which you want!!!) annotation for your genome. The 100 allows your reads to overhang each splice junction by maximum 100 bp. If your reads are longer (150 ?) then make that the value of this parameter. Then map against this.
NB If you plan to do differential expression, use featureCounts or HTSeq to counts to that gencode GTF.
It will just be faster with more cores but not influence the behavior of the index files., essentially, the files will be equal regardless of the number of cores used.
Just for clarity the primary assembly GRCm38 is not the same genome as mm10 from UCSC correct? So per the encode data standards you would download mm10 as the genome (which is based on GRCm38) and then use the gencode comprehensive gtf for annotation?
@Alex has some pre-made indexes available at STAR Genomes site. There does not appear to be a UCSC version of Mouse but there is Gencode Mouse which you can use.
I have another newbie question though, when I follow your Gencode Mouse link I find a bunch of links available. Would you be able to tell me which one I should use as the index when I'm doing the alignment?
Thanks again and sorry if this is a silly question!
Thank you so much for the response, this is very helpful!
I'd very strongly suggest you build your own index! To do this:
Download the two files I suggested
then run
wherever you saved those files then:
This will use 4 cores to generate a genome and splice junction (which you want!!!) annotation for your genome. The 100 allows your reads to overhang each splice junction by maximum 100 bp. If your reads are longer (150 ?) then make that the value of this parameter. Then map against this.
NB If you plan to do differential expression, use featureCounts or HTSeq to counts to that gencode GTF.
OK thank you very much! I will definitely try this out.
One more question. The CPU of the server I'm using has 40 cores, does this change how many I should use to build the index?
Thanks again!
It will just be faster with more cores but not influence the behavior of the index files., essentially, the files will be equal regardless of the number of cores used.
OK Awesome, Thanks again
Just for clarity the primary assembly GRCm38 is not the same genome as mm10 from UCSC correct? So per the encode data standards you would download mm10 as the genome (which is based on GRCm38) and then use the gencode comprehensive gtf for annotation?