I have some difficulties in understanding the option sjdbOverhang in STAR. This option is set when making use of a splice junctions database. The manual defines it to be: "the length of the donor/acceptor sequence on each side of the junctions, ideally = (mate_length - 1)". It seems to be a very important option, because if it is set to 0 (default), the splice junctions database is not used.
I don't think it's the minimal alignment length for a read spanning the junction, because there's already the option alignSJDBoverhangMin for that, which is defined as "the minimum overhang (i.e. block size) for annotated (sjdb) spliced alignments".
is it then the expected length maybe?
This also means that for every different read-length to be aligned a new genome SA needs to be generated. Otherwise a drop in aligned reads can be experienced.
If we have data from 2 batches with different read lengths, does that means we suppose to map the data separately with different indexes?
In my understanding, that would be optimal, but since v2.4 STAR has an option to set --sjdbOverhang and other sjdb options on the fly during alignment: https://github.com/alexdobin/STAR/blob/master/doc/STARmanual.pdf
This is what I found. https://biocorecrg.github.io/RNAseq_course_2019/alnpractical.html It usually equals to the minimum read size minus 1; This also means that for every different read-length to be aligned a new STAR index needs to be generated. Otherwise a drop in aligned reads can be experienced.
Thanks for the follow up.
Here's how to find read length: How do I find out the read lenght of a fastq file?