Hi all,
I know the question of using multiple reference genomes with STAR has been asked and answered several times, e.g. here https://groups.google.com/g/rna-star/c/HnQ3obe9Hqk?pli=1 and there https://groups.google.com/g/rna-star/c/ChvUy8jG6ts
For my analysis, I am using a custom genome made of the combination of human (hg38) and rat (rn6) genomes. In my downstream analysis, I see a human gene that appears as differentially expressed between a condition “human and rattus cells are physically separated” and a condition “human and rattus cells are cultivated together”. This gene has a function that is related to my experiment. However, one thing that struck me is that this gene has a 0 read count in the “separated cells” condition. When going back to the read alignment, extracting the reads mapping to the human version of this gene, and mapping them to a reference alignment (rat and human versions aligned using MAFFT, reads aligned to this alignment using MAFFT –addfragment option), it is clear that these reads come from the rat genome.
I used option --outFilterMultimapNmax 1
during the alignment step with STAR v. 2.7.11a, everything else being default options. I expected STAR to output the best-matching alignment, but I am now not quite sure it is the correct settings for that. In fact, the results that I describe above show clearly that this is not the case. Should I output more alignments using higher --outFilterMultimapNmax
and --winAnchorMultimapNmax
as suggested here: https://groups.google.com/g/rna-star/c/HnQ3obe9Hqk?pli=1 and filter output BAM/SAM file?
Cheers,
Theo
I assume this is bulk? If you only need quantification you would be better off using a method that estimates transcript abundances such as Salmon or Kallisto. These will generally resolve multimappers across species more definitively via their EM approach.
Thanks for your answer!
This is indeed bulk-RNA. Also, thanks for the suggestion of pivoting to transcript abundance, but I would rather keep it gene-level (for sake of comparison with previous results ...).
I can providesome more details. For example, I have the following read:
This is the BAM output by STAR (options
--winAnchorMultimapNmax 200 --outFilterMultimapNmax 2 --outFilterMultimapScoreRange 10 --seedSearchStartLmax 15 --seedSearchStartLmaxOverLread 0.2 --outFilterMismatchNmax 10 --outFilterMismatchNoverLmax 0.1 --outSAMattributes All --alignSJDBoverhangMin 2 --seedSplitMin 9
to increase sensitivity):And this is what I observe when I manually map read1 against transcripts NM_003182.3 (Homo sequence) and NM_012666.2 (Rattus sequence):
So, it seems that STAR gives a higher score to the alignment to the human version than to the rat version, but I cannot understand why. Indeed, the alignment to the rat version has much less mismatches than to the human version. I do not understand why the score for the rat alignment is lower than the score to the human version, except if the alignment is shorter, but my manual alignment shows that this should not be the case.
Do you have any clue of what I am missing?
It was a comment, not an answer. Similarly, you're replying to rpolicastro's comment, so please don't add answers - use
Add Reply
(orAdd Comment
when appropriate). I've moved your post to the right location but please be more mindful in the future.Sorry, I was a bit confused by the layout, thanks for editing and clarifying!
What is with all these options? Your seed search options are splitting your 75-bp read into chunks of 15, forcing the first 15-bp chunk (a fraction of which does not overlap Rattus_4 at all) to be completely soft-clipped.
I searched the different options to increase the sensitivity. As per
--seedSearchStartLmax
, I found several answer from Alexander Dobin on rthe rna-star google group indicating that reducing values of--seedSearchStartLmax
and/or--seedSearchStartLmaxOverLread
would achieve what I was looking for, e.g. https://groups.google.com/g/rna-star/c/_SXz370eyVA or https://groups.google.com/g/rna-star/c/E2eevlGWFbQ.I do not understand how this would be related to soft-clipping, could you please explain?
In that first link, Dr. Dobin states how such configurations will "split" the reads into pieces -- exactly what I alluded to in my comment.
If you have a 15-bp chunk at the 5' end of your read that cannot be aligned, it will be trimmed off (i.e. soft-clipped). You can see this for yourself in that CIGAR string: 15S36M1I22M1S (15S = soft-clip the first 15 bases). Even the human alignment is problematic for that chunk: it aligned found 4 bases somewhere random to align to and then skips over 408 bases.
I recommend not toying around with settings too much unless you know what you're doing. If you show all BAM alignments for that read under default STAR settings, I can try to help you figure out what's going on.
I have finally found out the reason behind this behavior.
As pointed out by dsull, the beginning of the alignment of read1 to the rattus genome is soft-clipped. My Rattus genome is assembly version 7.2 (downloaded from Ensembl), with annotations from Ensembl release 110. When looking at TAC1 in this release, it appears there is an annotation error for this gene, which is considered a pseudo-gene and is truncated at its 5' end, hence the bad alignment and soft clipping. The sequence is also different from NM_012666 at various places, resulting in better alignment scores to the human ortholog than to the rat one.