Entering edit mode
3.0 years ago
Husain Poonawala
▴
10
Hi,
I'm trying to analyze RNA-Seq data for a bacteria - Mycobacterium tuberculosis. I used the FASTA and GTF files from NCBI to create the index, and set the --genomeSAindexNbases at 8 based on this previous post. The bash script I used is: `
# load modules
module load gcc/6.2.0 star/2.7.0a
# launch star
STAR --runThreadN 8 \
--runMode genomeGenerate \
--genomeDir /home/xyz/scratch/sanraffaele/indices/star/ \
--genomeFastaFiles ~/reference_data/NC000962_3.fasta \
--sjdbGTFfile ~/reference_data/NC000962_3.gtf \
--genomeSAindexNbases 8
The index generation is taking ~15 seconds, and on reviewing the files in the folder it appears that the index has only 70 or so transcripts. Between the short time to generate the index (genome length is 4M bp) and the presence of so few transcripts, I know that something is wrong. Any suggestions about what I should differently?
Since you don't need to worry about splicing there is no specific advantage to using
STAR
. You could use any aligner.Not sure what you mean by that. It is not unusual to have the index finish quickly. You have a small genome. You can try doing an alignment and see what you get.
Thank you - I will try that.
Update: I realized that generating the index needs only the FASTA file. The GTF file is necessary only if one is interested in generating a read count matrix. For bacterial GTF files, Alex Dobin recommends changing column 3 to "exon" for all entries as discussed in this post