Fastq and sam number of reads
3
0
Entering edit mode
6.8 years ago
ceboral • 0

Hi! I'm having some little problems trying to figure out if the mapping of my datasets were correct. I have downloaded the datasets in SRA files, transform them into FASTQ and finally map the FASTQ to the mouse reference genome using STAR. In order to test if the mapping was correct, I used samtools. I counted the FASTQ reads by counting the lines and divide them for 4. For the SAM, I counted the different FLAGS (samtools view SRR5315260.Aligned.out.sam | cut -f2 | sort | uniq -c) in the sam file. Furthermore, I also counted the number of lines in the SAM file (wc -l). The problem comes when trying to explain than there are more FLAGS in the sam file than lines in the sam file, but there are more reads in the FASTQ file than lines in the SAM file. I expected to be the same number. Can someone explain to me if I am doing something wrong or if this is normal? Thanks!

SAM NGS FASTQ • 4.2k views
ADD COMMENT
0
Entering edit mode

SAM / BAM files may or may not include unmapped reads, and you didn't tell us which STAR parameters did you use. Also, SAM / BAM files may include multiple copies of reads mapped to multiple locations, so samtools view will count these reads multiple times.

What are the numbers you found, and what were the commands used to get to these numbers? Did you check the BAM file with samtools flagstat?

ADD REPLY
0
Entering edit mode

Thank you all of you for your answers. I have checked the Log.final.out and use samtools view -c -F 2431 SRR5315260.bam and the number of reads still don't match the number of reads in the fastqc analysis. I haven't done any filtering anywhere and the difference is quite important (around 8 million reads in datasets of 45 million reads).

ADD REPLY
0
Entering edit mode

Please use ADD REPLY/ADD COMMENT when responding to existing posts (or add new information by editing your original post) to keep threads logically organized..

ADD REPLY
1
Entering edit mode
6.8 years ago

Flags aren't meant to be unique, they indicate only information like alignment orientation. For your example, you have single-end data so you probably only have 3 different flags (forward, reverse, and unmapped). What you probably meant to do was count the unique number of reads in the file samtools view -c -F 2432 SRR5315260.bam, which will do that (the 2432 will ensure that mates/multimappers/etc. aren't counted more than once). That will presumably be the same number of entries as there are in the fastq file. If not, then you've done something wrong or done some filtering somewhere.

ADD COMMENT
1
Entering edit mode
6.8 years ago

To check the numbers between fastq and sam, you have to consider several factors. At least two come to mind: you might have multimapped reads, so you should have more lines in sam file than reads in fastq. On the other hand, you should check whether or not STAR by default outputs unmapped reads. But overall, if this analysis is not really important for you, I don't think it worth spending time on it, STAR is quite mature software. Anyway, one easy thing you can do is to run fastqc on your fastq file (this will give if number of reads), and then check number of input reads in STAR Log.final.out.

On your commands: samtools view SRR5315260.Aligned.out.sam | cut -f2 | sort | uniq -c - in context of your analysis this doesn't make sense to me. I don't think that each rad has its own unique flag.

wc -l - if you do this on sam file directly it will also count headers.

ADD COMMENT
0
Entering edit mode
6.8 years ago

Another way to count unique reads is to count how many read names appear (This assumes that the first and second reads have identical names, otherwise, do "samtools view -f 64" to only get read 1)

samtools view  SRR5315260.Aligned.out.bam | cut -f 1 | sort | uniq -c | wc -l

And as others have said, counting flags makes no sense.

ADD COMMENT

Login before adding your answer.

Traffic: 2365 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6