I have a total of 300 million reads in my R1 and R2 raw fastq files.
How come fastQC count about 50% duplicates in each .fastq files, but after alignment with STAR (default parameters), none of the reads have the 1024 flag (PCR or optical duplicates) when doing samtools view -c -f 1024 mydata.bam?
Because STAR with default settings doesn't mark duplicates, you have to use --bamRemoveDuplicatesType UniqueIdentical to have duplicates marked (or --bamRemoveDuplicatesType UniqueIdenticalNotMulti).
edit: also, to elaborate on the point genomax makes:
STAR (if you choose to mark duplicates with the correct parameter, which apparently you didn't) and FastQC use very different methods of detecting duplicates, so it shouldn't be surprising if the number os estimated duplicates is different. The documentation for FastQC states:
To cut down on the memory requirements for this module only sequences which first appear in the first 100,000 sequences in each file are analysed
And also:
Because the duplication detection requires an exact sequence match over the whole length of the sequence, any reads over 75bp in length are truncated to 50bp for the purposes of this analysis.
Whereas STAR uses mapping position:
UniqueIdentical
mark all multimappers, and duplicate unique mappers.
Thanks for your very complete answer !
I actually did use --bamRemoveDuplicatesType UniqueIdentical, but my version of STAR (2.5.1b) does not tag duplicates in the same time as aligning fastq reads.
I needed to (1) align the .fastq reads; (2) tag duplicates using star --runMode inputAlignmentsFromBAM --inputBAMfile input.bam --bamRemoveDuplicatesType UniqueIdentical.
You are correct, and current version (2.7.1a) works the same as the version you are using (2.5.1b). The pdf manual is incomplete and doesn't explain this, but the command-line help does:
### BAM processing
bamRemoveDuplicatesType -
string: mark duplicates in the BAM file, for now only works with (i) sorted BAM fed with inputBAMfile, and (ii) for paired-end alignments only
- ... no duplicate removal/marking
UniqueIdentical ... mark all multimappers, and duplicate unique mappers. The coordinates, FLAG, CIGAR must be identical
UniqueIdenticalNotMulti ... mark duplicate unique mappers but not multimappers.
In addition to the points made by @h.mon and @genomax, another point is that if I remember correctly, FastQC only gives you the single-ended duplication rate - that is from a given fastq file, how many reads have exactly the same sequence.
However, this is not how tools that function on alignments (like STAR's --bamRemoveDuplicatesType UniqueIdentical, or Picard's MarkDuplicates or samtools's markdup) function. They look at a pair of reads. If they have the same outer alignment co-ordinates, they are marked as duplicates. Thus in the following pairs:
fastqc would mark everything as a duplicate because (R1=R3, R2=R6, R5=R7 and R4=R8). This assumes no sequencing errors.
samtools/Picard/STAR would mark nothing as a duplicate because no pair has identical coordinates at both ends.
Thanks for your explanation. It's a bit out of the topic here, but I am just curious. According to you, if one have single-end reads, fastqc and samtools / picard / STAR should output the same number of duplicates then?
Well almost. FastQC tags on having the same sequence. samtools/picard/STAR tags on having the same alignment coordinates. The two might not be equivalent in cases where you have a sequencing error in a read that leads to it having a different sequence, but still aligning to the same location.
FastQC does not look at your entire dataset for some of the parameters it is testing. That would be memory/time consuming. It samples your data as it goes along.
So that 50% number is an estimate. It may vary depending how reads are distributed in your data.
Thanks for your very complete answer ! I actually did use
--bamRemoveDuplicatesType UniqueIdentical
, but my version of STAR (2.5.1b) does not tag duplicates in the same time as aligning fastq reads. I needed to (1) align the .fastq reads; (2) tag duplicates usingstar --runMode inputAlignmentsFromBAM --inputBAMfile input.bam --bamRemoveDuplicatesType UniqueIdentical
.You are correct, and current version (2.7.1a) works the same as the version you are using (2.5.1b). The pdf manual is incomplete and doesn't explain this, but the command-line help does: