Hi, I'm doing variant calling from my RNAseq data on a cell line (Paired End 2X100b, Illumina) I'm following GATK workflows and everything is working almost fine since I can find all the expected mutations already characterized in my cell line. The problem is that I see some additional mutations that are clearly the result of an alignment error (e.g. NM_001364837:exon8:c.727+1G>A, rs781065280) I'm using the latest release of STAR aligner 2.7.6a to perform the alignment with the following options
STAR --runThreadN 8 --genomeDir $GENOME_DIR \
--readFilesIn ... --readFilesCommand gunzip -c \
--outSAMattrRGline ... \
--outSAMtype BAM SortedByCoordinate \
--twopassMode Basic \
--outSAMmapqUnique 60 \
--outFilterType BySJout --alignIntronMin 20 \
--alignIntronMax 1000000 --alignMatesGapMax 1000000 \
--alignSJoverhangMin 8 --alignSJDBoverhangMin 3 --scoreGenomicLengthLog2scale 0 \
--outFileNamePrefix $OUTPUT_DIR/...
The problem resides in STAR mapping some reads with a mismatching overhang of 4 bp at beginning of the intron instead of picking a perfect match at the beginning of the next exon (see picture attached).
As you can see, --alignSJDBoverhangMin
is set to 3 and the length of the overhang is 4bp.
Also the splice junction should be canonical so a spliced mapping should carry no penalty.
Anybody has an explanation for that?