Question

Stuck creating reference genome with STAR

0

Entering edit mode

8.9 years ago

nash.claire ▴ 510

Hi again,

I want to use STAR to run my RNA-seq analysis however I'm having issues at the first hurdle trying to generate a reference genome.

I want to use the newest rat rn6 build but keep getting errors with genomeGenerate. here is my command :

--runMode genomeGenerate --genomeDir /path/to/directory --genomeFastaFiles ~/path/to/directory/rn6_chr1.fa rn6_chr2.fa rn6_chr3.fa rn6_chr4.fa rn6_chr5.fa rn6_chr6.fa rn6_chr7.fa rn6_chr8.fa rn6_chr9.fa rn6_chr10.fa rn6_chr11.fa rn6_chr12.fa rn6_chr13.fa rn6_chr14.fa rn6_chr15.fa rn6_chr16.fa rn6_chr17.fa rn6_chr18.fa rn6_chr19.fa rn6_chr20.fa rn6_chrMT.fa rn6_chrX.fa rn6_chrY.fa --sjdbGTFfile ~/path/to/directory/rn6.gtf --sjdbOverhang 49 --runThreadN 12 --outFileNamePrefix /path/to/directory/rn6

and here is my error

EXITING because of INPUT ERROR: could not open genomeFastaFile: path/to/directory/rn6_chr1.fa

So here are some points and errors I've already covered after reading posts and forums

- I'm using separate chromsome fasta files as I read that using toplevel.dna files is not good and there isn't a primary.dna file for rn6 yet. I tried toplevel fa file with no success.

-I've gone through and checked that every directory where my files are stored and my output directories etc are fully writable, readable and executable with chmod.

- my genomeDir is completely empty and is situated on a RAID with tons of free space.

- my fasta files and gtf file was downloaded from ensembl and both look fine.

- I'm running this on a Mac Pro which has a 12 core processor and 64gb of RAM and have played with the thread settings which had no effect.

- my reads are 50 bp in length and paired end hence me using the 49 sjdbOverhang setting

I'm completely stuck and lost guys. The manual isn't helping and I've exhausted all the STAR google group and biostars posts relating to this. Can anyone help??

genome rna-seq • 8.1k views

ADD COMMENT • link updated 6.1 years ago by valizad2 ▴ 20 • written 8.9 years ago by nash.claire ▴ 510

0

Entering edit mode

Hi guys,

Thanks so much for the help. I'll try playing around with the file path later and see if that works and I'll change the Overhang setting as suggested. The reason I have the separate chromosome files is because I started off with the toplevel.dna.fa file from Ensembl and genomeGenerate wasn't working. I read that we shouldn't use toplevel fasta files as they contain all the haplotype data etc etc and that it can cause issues. Since there is no primary.dna.fasta file available on Ensembl, I went for the separate chromosome files instead. However, I'd appreciate your opinion on the matter.....

ADD REPLY • link 8.9 years ago by nash.claire ▴ 510

0

Entering edit mode

Were you able to resolve this issue? I am having the same problem!

ADD REPLY • link 6.1 years ago by valizad2 ▴ 20

score 2 · Answer 1 · 2016-01-13

2

Entering edit mode

8.9 years ago

harold.smith.tarheel ★ 5.0k

That error is returned when the path is incorrect. Are the genomeFastaFiles nested in your home directory (~) as indicated, or should the path be from the top level like --genomeDir? You can check the path from the desired directory using 'pwd'.

ADD COMMENT • link 8.9 years ago by harold.smith.tarheel ★ 5.0k

Ram · Answer 2 · 2016-01-14

Check your home directory as harold.smith.tarheel said. If you are still experiencing problems then your fasta files might be corrupted.

Download the Illumina iGenome for rn6 here:

ftp://igenome:G3nom3s4u@ussd-ftp.illumina.com/Rattus_norvegicus/UCSC/rn6/Rattus_norvegicus_UCSC_rn6.tar.gz

Then run on your cluster

STAR --runMode genomeGenerate \
--genomeDir /path/to/directory \
--genomeFastaFiles /path/to/directory/Rattus_norvegicus/UCSC/rn6/Sequence/WholeGenomeFasta/genome.fa \
--runThreadN 12 --outFileNamePrefix /path/to/directory/rn6

Ram · Answer 3 · 2016-01-14

0

Entering edit mode

8.9 years ago

Michael 55k

In addition:

my reads are 50 bp in length and paired end hence me using the 49 sjdbOverhang setting

sjdbOverhang should be 99 as of mate length -1, that's 2*read length for paired end, afaik, just check with the documentation

Why do you want to break down the full fasta file, it just makes things more complicated? There are other ways to save memory, and I am not sure if that way reduces memory requirements at all.
if you still want to have per chromosome files, each one of them needs to have the correct path set, not just the first one, as in ~/path/to/directory/rn6_chr1.~/path/to/directory/rn6_chr2.fa ... ~/path/to/directory/rn6_chrY

not ~/path/to/directory/rn6_chr1 rn6_chr2.fa ... rn6_chrY

ADD COMMENT • link updated 4.9 years ago by Ram 44k • written 8.9 years ago by Michael 55k

0

Entering edit mode

For anyone else reading this thread, ... sjdbOverhang of 49 seems right to me. Here's a quote from the STAR manual:

--sjdbOverhang specifies the length of the genomic sequence around the annotated junction to be used in constructing the splice junctions database. Ideally, this length should be equal to the ReadLength-1, where ReadLength is the length of the reads. For instance, for Illumina 2x100b paired-end reads, the ideal value is 100-1=99. In case of reads of varying length, the ideal value is max(ReadLength)-1. In most cases, a generic value of 100 will work as well as the ideal value.

ADD REPLY • link 6.6 years ago by skhan ▴ 10