EDIT: problem seems to be that 88% of reads seem to be 'too short' but not sure how to resolve this -- more details in my comment below.
Hi all, I've been trying to re-map some RNA-seq fasta files to mm39 using STAR. I was told by the sequencing facility who ran the requencing for me that when they mapped the reads onto mm10 using BWA MEM their mapping frequency for each file was around 95%. However when I do the alignment, my unique reads frequency is a little under 10%. Judging by the mapping frequency achieved using BWA MEM I doubt that rRNA or other contamination is a problem. Do you have any suggestions as to why I'm coming up against this problem? Could I be using incorrect genome/annotation files for the genome indexing?
For reference, I used the following code to generate my genome index (with genome and annotation files from Ensembl):
STAR --runMode genomeGenerate --genomeDir mm39starindex/index/ \
--genomeFastaFiles /fullpath/Mus_musculus.GRCm39.dna.primary_assembly.fasta \
--sjdbGTFfile /fullpath/Mus_musculus.GRCm39.104.gtf \
--runThreadN 2
And the format of the code I'm using for the alignment:
#!/bin/bash
#SBATCH --job-name=star
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=16
#SBATCH --time=0-64:00:00
#SBATCH --output=outfile.%j
#SBATCH --error=errfile.%j
module load STAR/2.7.4
STAR --runThreadN 16 --genomeDir mm39starindex/mm39starindex/index \
--readFilesIn filename_1.fastq.gz filename_2.fastq.gz --readFilesCommand zcat \
--outSAMtype BAM SortedByCoordinate --quantMode GeneCounts \
--outFileNamePrefix alignments/filename
Any advice will be much appreciated!
fastqc your two fastq files before mapping might be helpful.
UPDATE: I extracted 25,000 reads from one of my fastq files and ran STAR just on that. I examined the output log and got the following:
According to this, 88% of my reads are not mapping because they are 'too short' (no clue how short that is). However, as far as I can understand STAR doesn't have a minimum requirement for read length. Nonetheless, I ran it again with
--outFilterMatchNmin 20
, and I got pretty much the same proportions of mapped vs. unmapped reads. This is confusing because my average input read length is 151 bases, which is very close to the average mapped length of 148 bases, which makes me think that the majority of reads are, in fact, of that approximate length, and should therefore be mappable. Again, any ideas are very welcome! Thanks :)Can you verify the following is not happening with your data? (LINK)
i.e. your R1/R2 reads are possibly not in sync. This could happen if you trimmed them individually.
Also see: https://groups.google.com/g/rna-star/c/jmszy_UjyKY/m/2drCEkG3CQAJ
"Too short" just means they didn't map. It has nothing to do with read length.
Thank you -- is there any reason that BWA MEM would be so good at mapping the fastq files (90+%) whereas STAR would perform much worse on the same files?
Well, are you sure that it's not paired reads getting out of sync? You've confirmed that you get the same bad results when aligning one end at a time?
Hi all, thank you for your many helpful suggestions. I have not been able to test some of these out yet as the server I have been using to run the alignments is currently down. Hopefully, though, it should be back up soon and I will be able to confirm my working hypothesis, which is the following: I re-downloaded the full mm39 genome assembly from Ensembl, and discovered that the file was about 30 times as large as the one I previously used to generate a STAR genome index. I re-generated an index with the new (apparently intact) file, and this time round it took me much longer (25 minutes) than it had previously taken me. Based on that, I am assuming that I used a partial/corrupted/incomplete genome assembly file to generate an index, which meant that most of my reads weren't aligning. I will test this out as soon as I can by attempting to re-map my fastq files with the new index and confirm whether mapping rates this time round are higher.
Honestly though I have no idea how the genome assembly file "shrank" -- I downloaded it a couple of weeks ago and other than copying it to my institutional server to run the alignment there, to my knowledge I didn't otherwise manipulate it.
You likely downloaded the
top level
assembly which includes haplotypes etc. For most analyses one does not need this (not for RNAseq anyway). You should downloadprimary
assembly which includes the main chromosomes and re-index this. You will find that this would be the correct size ~2.7G.Thank you very much -- as far as I could tell, the primary file was the one I went for initially, but I will keep this in mind for the future.