Question

[STAR] Bam file show more reads than the mapping log file after mapping

0

Entering edit mode

2.6 years ago

Caconym • 0

hello everyone,

I am analyzing some small RNA sequenced with an UMI and I use STAR to map them. I want to conserve the multimappers for further analysis. here is the command :

STAR --runThreadN 24 --genomeDir /path/to/reference-genome/Indexes --readFilesIn $i --readFilesCommand - --outSAMstrandField intronMotif --outFilterMultimapNmax 5000 --outFilterScoreMinOverLread 0 --outFilterMatchNmin 18 --outFilterMatchNminOverLread 0 --outFilterMismatchNoverLmax 0.04 --alignIntronMax 1 --alignEndsType EndToEnd --outFileNamePrefix /usr/users/path/to/file/starout/${i}/

I want a perfect mapping as my small RNA of interest are between 18 and 40nt, so no soft clipping. STAR give me an output and the sam file. I then sort and indexe the sam file with samtool

for i in *.sam; do samtools view -b ${i}| samtools sort -o /starout/${i}.bam && samtools index /starout/${i}.bam; done

but the number of reads in the bam file is superior to the number of mapped reads in the star output. for example, I got a star output as

"Uniquely mapped" 63295

"Mapped to multiple loci" 788900

"Unmapped: too many mismatches" 228470

"Unmapped: too short" 28801

"Unmapped: other" 181975

For a total of 1 291 441 reads and the bam file analyzed by fastqc give me

"Unique Reads" 342757

"Duplicate Reads" 10937674

total 11 280 431 reads

Where do these reads come from? Does the bam file include all the multi-mapping positions? Am I missing something?

mapping scRNAseq STAR smallRNA • 689 views

ADD COMMENT • link 2.6 years ago by Caconym • 0

score 1 · Answer 1 · 2022-04-13

1

Entering edit mode

2.6 years ago

Istvan Albert 102k

don't' worry about it,

it is probably just counting things differently, when a read partially matches to two locations, is that 1 alignment or 2, depends on how you report that.

note how the information above keeps talking about "reads" but bam files are not about reads,

BAM files represent alignments not reads

so it is just the overall sloppiness of bioinformatics lingo of reusing words with not the same meaning

ADD COMMENT • link 2.6 years ago by Istvan Albert 102k

0

Entering edit mode

Then the logical explanation is that the BAM file displays all the multimapping positions which cause this increase.

Thank you for your answer and the precision on BAM file alignements.

ADD REPLY • link 2.6 years ago by Caconym • 0