Question

STAR aligner options

0

Entering edit mode

12 months ago

theophile • 0

Hi all,

I know the question of using multiple reference genomes with STAR has been asked and answered several times, e.g. here https://groups.google.com/g/rna-star/c/HnQ3obe9Hqk?pli=1 and there https://groups.google.com/g/rna-star/c/ChvUy8jG6ts

For my analysis, I am using a custom genome made of the combination of human (hg38) and rat (rn6) genomes. In my downstream analysis, I see a human gene that appears as differentially expressed between a condition “human and rattus cells are physically separated” and a condition “human and rattus cells are cultivated together”. This gene has a function that is related to my experiment. However, one thing that struck me is that this gene has a 0 read count in the “separated cells” condition. When going back to the read alignment, extracting the reads mapping to the human version of this gene, and mapping them to a reference alignment (rat and human versions aligned using MAFFT, reads aligned to this alignment using MAFFT –addfragment option), it is clear that these reads come from the rat genome.

I used option --outFilterMultimapNmax 1 during the alignment step with STAR v. 2.7.11a, everything else being default options. I expected STAR to output the best-matching alignment, but I am now not quite sure it is the correct settings for that. In fact, the results that I describe above show clearly that this is not the case. Should I output more alignments using higher --outFilterMultimapNmaxand --winAnchorMultimapNmax as suggested here: https://groups.google.com/g/rna-star/c/HnQ3obe9Hqk?pli=1 and filter output BAM/SAM file?

Cheers,

Theo

STAR • 1.3k views

ADD COMMENT • link 12 months ago by theophile • 0

0

Entering edit mode

I assume this is bulk? If you only need quantification you would be better off using a method that estimates transcript abundances such as Salmon or Kallisto. These will generally resolve multimappers across species more definitively via their EM approach.

ADD REPLY • link 12 months ago by rpolicastro 13k

0

Entering edit mode

Thanks for your answer!

This is indeed bulk-RNA. Also, thanks for the suggestion of pivoting to transcript abundance, but I would rather keep it gene-level (for sake of comparison with previous results ...).

I can providesome more details. For example, I have the following read:

@read1
CGCAAAATCCAACATGAAAATCCTCGTGGCGGTGGCGGTCTTTTTTCTCGTTTCCACTCAACTGTTTGCAGAGGG
+
AEEEEEEEEEEEEAEEEEEAE6EEEAEEEEEEEEEAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAAAAA

This is the BAM output by STAR (options --winAnchorMultimapNmax 200 --outFilterMultimapNmax 2 --outFilterMultimapScoreRange 10 --seedSearchStartLmax 15 --seedSearchStartLmaxOverLread 0.2 --outFilterMismatchNmax 10 --outFilterMismatchNoverLmax 0.1 --outSAMattributes All --alignSJDBoverhangMin 2 --seedSplitMin 9 to increase sensitivity):

read1      0       Homo_7  97732192        3       4M408N68M3S     *       0       0       CGCAAAATCCAACATGAAAATCCTCGTGGCGGTGGCGGTCTTTTTTCTCGTTTCCACTCAACTGTTTGCAGAGGG        AEEEEEEEEEEEEAEEEEEAE6EEEAEEEEEEEEEAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAAAAA     NH:i:2  HI:i:1  AS:i:60 nM:i:6  NM:i:6  MD:Z:30C0T4A11T2C8G11   jM:B:c,21       jI:B:i,97732196,97732603
read1      256     Rattus_4        35679706        3       15S36M1I22M1S   *       0       0       CGCAAAATCCAACATGAAAATCCTCGTGGCGGTGGCGGTCTTTTTTCTCGTTTCCACTCAACTGTTTGCAGAGGG        AEEEEEEEEEEEEAEEEEEAE6EEEAEEEEEEEEEAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAAAAA     NH:i:2  HI:i:2  AS:i:53 nM:i:0  NM:i:1  MD:Z:58 jM:B:c,-1       jI:B:i,-1

And this is what I observe when I manually map read1 against transcripts NM_003182.3 (Homo sequence) and NM_012666.2 (Rattus sequence): Alignment of a read to human and rattus transcripts of gene TAC1

So, it seems that STAR gives a higher score to the alignment to the human version than to the rat version, but I cannot understand why. Indeed, the alignment to the rat version has much less mismatches than to the human version. I do not understand why the score for the rat alignment is lower than the score to the human version, except if the alignment is shorter, but my manual alignment shows that this should not be the case.

Do you have any clue of what I am missing?

ADD REPLY • link 12 months ago by theophile • 0

0

Entering edit mode

It was a comment, not an answer. Similarly, you're replying to rpolicastro's comment, so please don't add answers - use Add Reply (or Add Comment when appropriate). I've moved your post to the right location but please be more mindful in the future.

ADD REPLY • link 12 months ago by Ram 45k

0

Entering edit mode

Sorry, I was a bit confused by the layout, thanks for editing and clarifying!

ADD REPLY • link 12 months ago by theophile • 0

0

Entering edit mode

What is with all these options? Your seed search options are splitting your 75-bp read into chunks of 15, forcing the first 15-bp chunk (a fraction of which does not overlap Rattus_4 at all) to be completely soft-clipped.

ADD REPLY • link 12 months ago by dsull ★ 7.3k

0

Entering edit mode

I searched the different options to increase the sensitivity. As per --seedSearchStartLmax, I found several answer from Alexander Dobin on rthe rna-star google group indicating that reducing values of --seedSearchStartLmax and/or --seedSearchStartLmaxOverLread would achieve what I was looking for, e.g. https://groups.google.com/g/rna-star/c/_SXz370eyVA or https://groups.google.com/g/rna-star/c/E2eevlGWFbQ.

I do not understand how this would be related to soft-clipping, could you please explain?

ADD REPLY • link 12 months ago by theophile • 0

0

Entering edit mode

In that first link, Dr. Dobin states how such configurations will "split" the reads into pieces -- exactly what I alluded to in my comment.

If you have a 15-bp chunk at the 5' end of your read that cannot be aligned, it will be trimmed off (i.e. soft-clipped). You can see this for yourself in that CIGAR string: 15S36M1I22M1S (15S = soft-clip the first 15 bases). Even the human alignment is problematic for that chunk: it aligned found 4 bases somewhere random to align to and then skips over 408 bases.

I recommend not toying around with settings too much unless you know what you're doing. If you show all BAM alignments for that read under default STAR settings, I can try to help you figure out what's going on.

ADD REPLY • link 12 months ago by dsull ★ 7.3k

0

Entering edit mode

I have finally found out the reason behind this behavior.

As pointed out by dsull, the beginning of the alignment of read1 to the rattus genome is soft-clipped. My Rattus genome is assembly version 7.2 (downloaded from Ensembl), with annotations from Ensembl release 110. When looking at TAC1 in this release, it appears there is an annotation error for this gene, which is considered a pseudo-gene and is truncated at its 5' end, hence the bad alignment and soft clipping. The sequence is also different from NM_012666 at various places, resulting in better alignment scores to the human ortholog than to the rat one.

ADD REPLY • link 12 months ago by theophile • 0