Entering edit mode
2.4 years ago
Chris
▴
340
Hello all,
I would like to run STAR instead of hisat2.
hisat2 -q --rna-strandness R -x HISAT2/grch38/genome -U data/demo_trimmed.fastq | samtools sort -o HISAT2/demo_trimmed.bam
STAR --runThreadN 6 \
--runMode genomeGenerate \
--genomeDir chr1_hg38_index \
--genomeFastaFiles /home/doanc2/data/demo_trimmed.fastq \
--sjdbGTFfile /home/doanc2/hg38/Homo_sapiens.GRCh38.92.gtf \
--sjdbOverhang 99
I got this error:
EXITING because of INPUT ERROR: the file format of the genomeFastaFile: /vcu_gpfs2/home/doanc2/data/demo_trimmed.fastq is not fasta: the first character is '@' (64), not '>'.
Solution: check formatting of the fasta file. Make sure the file is uncompressed (unzipped).
So I have to convert my fastq file to fasta file, right?
If yes, I used: sed -n '1~4s/^@/>/p;2~4p' demo_trimmed.fastq > demo_trimmed.fasta
. Is that correct?
I got a new error:
EXITING because of FATAL PARAMETER ERROR: limitGenomeGenerateRAM=31000000000is too small for your genome
SOLUTION: please specify --limitGenomeGenerateRAM not less than 873673523466 and make that much RAM available
So how can I change the parameter as the solution suggested above? Thank you so much!
Read the STAR manual, please. You should generate a genome using the actual reference FASTA file, not your FASTQ files.
As for your second question, STAR is literally giving you the solution.
Thank you so much for your reply!
I see several hg38 files here:
http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/
So which file I should download?
Read the README there. Which file do you think would be most useful to you?
this one?
hg38.fa.gz - "Soft-masked" assembly sequence in one file. Repeats from RepeatMasker and Tandem Repeats Finder (with period of 12 or less) are shown in lower case; non-repeating sequence is shown in upper case. (again, the most current version of this file is latest/hg38.fa.gz)
Sure, again - it's the one most useful to you. Soft-masked assembly is a great choice. Personally I'd pick a reference genome from the Gencode project and not UCSC, but that's a personal choice because I like EnsEMBL's versioning system. As long as you record the source URLs, file versions and maybe the download dates, you're golden.
I was confused at UCSC so I downloaded this:
http://ftp.ensembl.org/pub/release-107/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa.gz
So Hisat2 doesn't require a reference genome but STAR needs, is that correct?
That's incorrect. You are passing
HISAT2/grch38/genome
to the-x
for HISAT2. The manual says that-x
accepts index prefix, and also says hisat2-build is used to generate index files. You have prebuilt index files for HISAT2 that you are now creating for STAR usingSTAR --runMode genomeGenerate
.For Pete's sake, read manuals.
Thank you so much for your answer!
Would you please explain why this reference genome is split into 8 files?
https://genome-idx.s3.amazonaws.com/hisat/grch38_genome.tar.gz
Also, I run STAR on a cluster and it took more than 120 minutes and still hasn't finished. Hisat2 run on a personal computer is only 3 minutes so I guess there is something wrong.
I do not have the bandwidth to download a file, extract it and do a bunch of comparisons to figure out why it's been split - you can read the manual, the paper and the source code, browse forums to see if it has been addressed anywhere or even email the author. In all probability, the reference genome hasn't been split, the prepared index has 8 files.
As for STAR vs HISAT2, look into benchmarking papers and ensure you're comparing apples to apples. Also, Googling terms such as "STAR vs HISAT2" will point to past discussions such as this one: HISAT2 V.S. STAR
Thanks for a detailed answer! As you see from my screenshot, the genome is split into 8 files.
Again, no. The prepared index has 8 files. Chris, the manual is extremely clear on what's happening, read it.
From the manual: