Hi! I'm having some little problems trying to figure out if the mapping of my datasets were correct. I have downloaded the datasets in SRA files, transform them into FASTQ and finally map the FASTQ to the mouse reference genome using STAR. In order to test if the mapping was correct, I used samtools. I counted the FASTQ reads by counting the lines and divide them for 4. For the SAM, I counted the different FLAGS (samtools view SRR5315260.Aligned.out.sam | cut -f2 | sort | uniq -c) in the sam file. Furthermore, I also counted the number of lines in the SAM file (wc -l). The problem comes when trying to explain than there are more FLAGS in the sam file than lines in the sam file, but there are more reads in the FASTQ file than lines in the SAM file. I expected to be the same number. Can someone explain to me if I am doing something wrong or if this is normal? Thanks!
SAM / BAM files may or may not include unmapped reads, and you didn't tell us which STAR parameters did you use. Also, SAM / BAM files may include multiple copies of reads mapped to multiple locations, so
samtools view
will count these reads multiple times.What are the numbers you found, and what were the commands used to get to these numbers? Did you check the BAM file with
samtools flagstat
?Thank you all of you for your answers. I have checked the Log.final.out and use samtools view -c -F 2431 SRR5315260.bam and the number of reads still don't match the number of reads in the fastqc analysis. I haven't done any filtering anywhere and the difference is quite important (around 8 million reads in datasets of 45 million reads).
Please use
ADD REPLY/ADD COMMENT
when responding to existing posts (or add new information by editing your original post) to keep threads logically organized..