hello everyone,
I am analyzing some small RNA sequenced with an UMI and I use STAR to map them. I want to conserve the multimappers for further analysis. here is the command :
STAR --runThreadN 24 --genomeDir /path/to/reference-genome/Indexes --readFilesIn $i --readFilesCommand - --outSAMstrandField intronMotif --outFilterMultimapNmax 5000 --outFilterScoreMinOverLread 0 --outFilterMatchNmin 18 --outFilterMatchNminOverLread 0 --outFilterMismatchNoverLmax 0.04 --alignIntronMax 1 --alignEndsType EndToEnd --outFileNamePrefix /usr/users/path/to/file/starout/${i}/
I want a perfect mapping as my small RNA of interest are between 18 and 40nt, so no soft clipping. STAR give me an output and the sam file. I then sort and indexe the sam file with samtool
for i in *.sam; do samtools view -b ${i}| samtools sort -o /starout/${i}.bam && samtools index /starout/${i}.bam; done
but the number of reads in the bam file is superior to the number of mapped reads in the star output. for example, I got a star output as
"Uniquely mapped" 63295
"Mapped to multiple loci" 788900
"Unmapped: too many mismatches" 228470
"Unmapped: too short" 28801
"Unmapped: other" 181975
For a total of 1 291 441 reads and the bam file analyzed by fastqc give me
"Unique Reads" 342757
"Duplicate Reads" 10937674
total 11 280 431 reads
Where do these reads come from? Does the bam file include all the multi-mapping positions? Am I missing something?
Then the logical explanation is that the BAM file displays all the multimapping positions which cause this increase.
Thank you for your answer and the precision on BAM file alignements.