Question

What are the best set of options to use while aligning using STAR?

5

Entering edit mode

5.5 years ago

gprashant17 ▴ 110

Hi, I have a tumour RNA-seq data, which I have aligned using STAR with a human reference genome, using the basic default options. However, I am confused with the variety of options (chimeric junctions, 2-pass mapping, readCounts, etc) available in the STAR manual and I would like to know which of these options are more suitable for doing an analysis to identify somatic mutations.

STAR RNA-Seq ngs alignment • 4.5k views

ADD COMMENT • link updated 5.5 years ago by Charles Warden 8.3k • written 5.5 years ago by gprashant17 ▴ 110

score 3 · Answer 1 · 2019-06-17

3

Entering edit mode

5.5 years ago

sandersonrbt ▴ 40

In general, I use and recommend others use STAR with the default settings unless there is a specific reason to change them. In my experience many options are great for fine-tuning your analysis to suit the particulars of your sequencing runs (trying to account for low depth/short-reads/etc) but I rarely start out with these options, rather I use them as I encounter problems. In many other cases these options are a matter of preference for how you want your files to look (sorted/bam/unmapped-output/etc). Things like 2-pass are useful options if you are trying to do a splicing analysis where you wish to identify junctions that are unannotated, but I don't think it is recommended for other analysis. If you are looking for chromosomal rearrangements in your analysis chimSegmentMin (which seems likely since you are using RNA-seq to detect mutations) would be absolutely something you would want to use, and 2-pass mode would likely improve detection of these rearrangements as they are effectively unannotated junctions. I personally prefer some of the other methods for inferring read-counts (kallisto/HTSeq) but if you want them quantified by star then use the readCounts option.

ADD COMMENT • link 5.5 years ago by sandersonrbt ▴ 40

0

Entering edit mode

Thanks, I would like to separate the mapped and unmapped reads into two different SAM/BAM files and I did not notice a way to do that in the manual. Is there any other way or including the unmapped reads within the same alignment file is the only option?

ADD REPLY • link 5.5 years ago by gprashant17 ▴ 110

1

Entering edit mode

I believe the option you would want there is --outReadsUnmapped Fastx. This will make a separate file of all the unmapped reads, albeit in fasta/fastq format (which you could potentially check for fusions using different tools/parameters). I am not sure how useful a BAM file of unaligned reads would be, as much of the information pertains to the alignment of the reads, which in this case does not exist. If you really do need those unaligned fastx files in BAM format there appear to be tools to do this, although I've no experience with them myself.

ADD REPLY • link 5.5 years ago by sandersonrbt ▴ 40

score 3 · Answer 2 · 2019-06-17

3

Entering edit mode

5.5 years ago

Charles Warden 8.3k

I have to admit that I am mostly using what has been recommended, more so than testing with several projects.

I think I actually have a similar answer as has already been provided. However, to put it a slightly different way:

I would usually run STAR with --twopassMode Basic --outSAMstrandField intronMotif. I think the 2nd parameter is supposed to help with certain programs, like cufflinks, since you don't specify strandedness for the library.

If I want to output files for gene fusion analysis, then I would add --chimSegmentMin 12 --chimJunctionOverhangMin 12 --chimOutType Junctions.

For specific projects, I have probably made other changes. For example, last time I checked, PacBio had recommendations for aligning IsoSeq CCS data. However, I think the most important (extra) parameters are --outSAMattributes All --readNameSeparator space --seedPerReadNmax 10000.

ADD COMMENT • link 5.5 years ago by Charles Warden 8.3k

1

Entering edit mode

I just tried using --chimSegmentMin 20 --chimOutType WithinBAM. Most of my reads are 2*101 base pairs long and the minimum is around 40 (for one mate). I used WithinBAM option as that would allow me to use another option --peOverlapNbases, for detecting overlapping within read pairs. As this would result in the same BAM file having details of both normal and chimeric alignments, will it make a difference while doing the analysis?

ADD REPLY • link 5.5 years ago by gprashant17 ▴ 110

1

Entering edit mode

I haven't extensively tested all the parameters for STAR.

However, it sounds like you are working with processed reads (rather than raw reads). I'm not sure how much of a difference this makes.

That said, you did specifically mention somatic variant calling. For that, I would be further processing your sample to split reads between exons (not something that you would want to do for gene expression, but this should help with false positives around junctions and soft-clipped reads).

So, with or without an extra upstream filter, the recommendation is to run SplitNCigarReads in GATK.

I haven't looked into the code, but there was recently a paper describing use of MuTect2 for RNA-Seq variant calls (which they call RNA-MuTect).

My previous experience with MuTect2 would make me want to first consider VarScan variants and/or joint germline GATK variants (and possibly re-calculate allele fractions with a .pileup file). However, it sounds like they were conservative with their strategy (I think the extra filters remove ~93% of the raw MuTect2 variants).

So, that code is here:

https://zenodo.org/record/2620062

and my thoughts from reading the paper (but without testing the code) are here:

http://cdwscience.blogspot.com/2019/06/considerations-for-somatic-mutations.html

ADD REPLY • link 5.5 years ago by Charles Warden 8.3k

1

Entering edit mode

FYI, I was just taking a look at that RNA-MuTect code.

Even though the GATK RNA-Seq Best Practices used STAR (2-pass alignment), RNA-MuTect performs a HISAT re-alignment (although I do remember them mentioning use of 2 alignments per sample, and they cite the STAR paper). I think there are also some other factors that may make running those scripts exactly a little difficult. Plus, this also make me a little confused when they describe "The other two filtering steps (oxoG and NovoAlign re-alignment) were not initially designed for RNA-seq data." in the Supplemental Materials, but I did add this note in the blog post (essentially, the methods are described differently for "1. Mutation calling pipeline" versus "3. The RNA-MuTect pipeline").

However, after you have your set of variant calls (from your STAR alignment), perhaps you can look at some of the scripts for the filtering steps (even if you have to modify them for yourself)

ADD REPLY • link 5.5 years ago by Charles Warden 8.3k