Question

problem indexing genome and alignment with STAR aligner

0

Entering edit mode

7.0 years ago

pr.khavari • 0

Hi everyone,
I am trying to generate genome indexes with STAR to align my RNAseq data, with this command line:

 /data/software/STAR/source/STAR --runThreadN 16 --runMode genomeGenerate --genomeDir star_genome3/ --genomeFastaFiles Pvulgaris_442_v2.0.fa --sjdbGTFfile phavu.G19833.gnm2.ann1.PB8d.gene_exons.gff3 --sjdbGTFfeatureExon exon --sjdbGTFtagExonParentTranscript Parent --genomeChrBinNbits 18 --sjdbOverhang 100

but after 5 min it ends, I think that have problem with such speedy.

The list output of it is here:

Genome SAindex chrName.txt chrStart.txt exonInfo.tab genomeParameters.txt sjdbList.fromGTF.out.tab transcriptInfo.tab SA chrLength.txt chrNameLength.txt exonGeTrInfo.tab geneInfo.tab sjdbInfo.txt sjdbList.out.tab

Then I change runmod to alignment with this script:

 /data/software/STAR/source/STAR --runMode alignReads --genomeDir /data/mshoorooei/star_genome4/ --runThreadN 16 --outFilterMismatchNmax 2 --readFilesIn PE_27_F.fq.gz PE_27_R.fq.gz --readFilesCommand gunzip -c --outFileNamePrefix 27_ --outReadsUnmapped unmapped_27 --outSAMtype BAM SortedByCoordinate

Output is here:

27_Aligned.sortedByCoord.out.bam
27_Log.final.out
27_Log.out
27_Log.progress.out
27_SJ.out.tab

unfortunately, this gives me the same problem too.

Do you have any idea? thanks for your suggestions.

RNA-Seq genome alignment • 3.7k views

ADD COMMENT • link updated 5.8 years ago by h.mon 35k • written 7.0 years ago by pr.khavari • 0

score 1 · Answer 1 · 2017-12-02

1

Entering edit mode

7.0 years ago

Michael 55k

This looks like a normal run, and your genome is probably rather small. Look into the output of of Log.progress.out and Log.final.out. So it is just faster than you expected, but there isn't anything wrong.

Edit: I see a smaller issue:

you should use:

--outReadsUnmapped Fastx

not --outReadsUnmapped some_filename that is maybe the reason for why you don't get unmapped.out.mate1/2 files

ADD COMMENT • link 7.0 years ago by Michael 55k

0

Entering edit mode

thanks for your comment, My genome is nearly 600 Mb, how can I understand Log.progress.out and Log.final.out is right??

ADD REPLY • link 7.0 years ago by pr.khavari • 0

1

Entering edit mode

You should watch the output of STAR while it is running, during genome generation it should output:

   Nov 22 10:01:37 ..... started STAR run
Nov 22 10:01:37 ... starting to generate Genome files
Nov 22 10:02:34 ... starting to sort Suffix Array. This may take a long time...
Nov 22 10:03:00 ... sorting Suffix Array chunks and saving them to disk...
Nov 22 10:06:54 ... loading chunks from disk, packing SA...
Nov 22 10:07:24 ... finished generating suffix array
Nov 22 10:07:24 ... generating Suffix Array index
Nov 22 10:09:48 ... completed Suffix Array index
Nov 22 10:09:48 ..... processing annotations GTF
Nov 22 10:09:48 ..... inserting junctions into the genome indices
Nov 22 10:10:26 ... writing Genome to disk ...
Nov 22 10:12:04 ... writing Suffix Array to disk ...
Nov 22 10:13:04 ... writing SAindex to disk
**Nov 22 10:13:26 ..... finished successfully**

This was for a 680MBase genome in 33000 scaffolds, and 120 CPUs but I don't think multi-core helps much during genome generate.

During alignment it should output something like:

Jun 06 21:12:47 ..... Started STAR run
Jun 06 21:12:47 ..... Loading genome
Jun 06 21:12:47 ..... Started mapping
Jun 06 21:14:09 ..... Finished successfully

Using Log.final.out you can then compare the number of input sequences with the number of sequences in your input file (they should be the same of course) and the mapping rate (90%+ is common for good data)

ADD REPLY • link 7.0 years ago by Michael 55k

0

Entering edit mode

It is running during genome generation. it seems the same.

 Dec 02 09:58:02 ..... started STAR run
Dec 02 09:58:02 ... starting to generate Genome files
Dec 02 09:58:10 ... starting to sort Suffix Array. This may take a long time...
Dec 02 09:58:13 ... sorting Suffix Array chunks and saving them to disk...
Dec 02 10:00:26 ... loading chunks from disk, packing SA...
Dec 02 10:00:40 ... finished generating suffix array
Dec 02 10:00:40 ... generating Suffix Array index
Dec 02 10:01:58 ... completed Suffix Array index
Dec 02 10:01:58 ... writing Genome to disk ...
Dec 02 10:01:58 ... writing Suffix Array to disk ...
Dec 02 10:02:00 ... writing SAindex to disk
Dec 02 10:02:01 ..... finished successfully

The running star during alignment.

Dec 02 10:28:24 ..... started STAR run
Dec 02 10:28:24 ..... loading genome
Dec 02 10:28:26 ..... started mapping
Dec 02 10:32:17 ..... started sorting BAM
Dec 02 10:33:38 ..... finished successfully

ADD REPLY • link 7.0 years ago by pr.khavari • 0

0

Entering edit mode

So there is no obvious error. Your genome generation is faster than ours, but this is probably IO related.