Hi everyone,
Sorry if this question is rather philosophical but, I find it important.
As one can read in the STAR aligner original paper (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3530905/):
Similarly to other RNA-seq aligners, STAR’s default parameters are optimized for mammalian genomes. Other species may require significant modifications of some alignment parameters; in particular, the maximum and minimum intron sizes have to be reduced for organisms with smaller introns.
I was dealing with a pipeline I got from a collaborator that established --alignIntronMax 500000
(500k). Now, as I understand, this value should be defined according with the species we are working with (in this case Homo sapiens). Checking the paper https://pubmed.ncbi.nlm.nih.gov/26581719/ (table 2), the maximum intron length reported is 1160411; using my gff file, R and GenomicFeatures
package (https://support.bioconductor.org/p/103386/), I go the value 1240120.
Taking all of this into consideration, my main question is: should we adapt this default parameters in STAR taking into consideration the most recent values (~1M), even though the default options were optimized for mammalian genomes?
(Addition: I made an histogram with intron length frequency using R.)
Thanks in advance
My personal view is to never change settings in standard tools that well stood the thest of time unless there is a good data-driven reason (i.e. you experience a clear problem). This histogram could be based on a single width of 1,200,000 while 99.999% of other widths are below the default threshold. Just leave it. The outlier could even be one of these obscure genes with fishy annotation that nobody knows or cares about. I would just do my analysis and move on. These strange corner cases are everywhere, at every analysis step, so best to ignore unless there is a good reason not to.