Question

STAR index generation for bacterial genome

1

Entering edit mode

3.0 years ago

Husain Poonawala ▴ 10

Hi,

I'm trying to analyze RNA-Seq data for a bacteria - Mycobacterium tuberculosis. I used the FASTA and GTF files from NCBI to create the index, and set the --genomeSAindexNbases at 8 based on this previous post. The bash script I used is: `

# load modules
module load gcc/6.2.0 star/2.7.0a

# launch star
STAR --runThreadN 8 \
--runMode genomeGenerate \
--genomeDir /home/xyz/scratch/sanraffaele/indices/star/ \
--genomeFastaFiles ~/reference_data/NC000962_3.fasta \
--sjdbGTFfile ~/reference_data/NC000962_3.gtf \
--genomeSAindexNbases 8

The index generation is taking ~15 seconds, and on reviewing the files in the folder it appears that the index has only 70 or so transcripts. Between the short time to generate the index (genome length is 4M bp) and the presence of so few transcripts, I know that something is wrong. Any suggestions about what I should differently?

STAR bacteria index • 2.0k views

ADD COMMENT • link 3.0 years ago by Husain Poonawala ▴ 10

1

Entering edit mode

Since you don't need to worry about splicing there is no specific advantage to using STAR. You could use any aligner.

it appears that the index has only 70 or so transcripts

Not sure what you mean by that. It is not unusual to have the index finish quickly. You have a small genome. You can try doing an alignment and see what you get.

ADD REPLY • link 3.0 years ago by GenoMax 147k

0

Entering edit mode

Thank you - I will try that.

ADD REPLY • link 3.0 years ago by Husain Poonawala ▴ 10

0

Entering edit mode

Update: I realized that generating the index needs only the FASTA file. The GTF file is necessary only if one is interested in generating a read count matrix. For bacterial GTF files, Alex Dobin recommends changing column 3 to "exon" for all entries as discussed in this post

ADD REPLY • link 3.0 years ago by Husain Poonawala ▴ 10