Hi,
I do RNASeq data processing, performed trimming and did an alignment with BWA. My BAM is then sorted by coordinates.
The lab told me they had to increase the PCR cycles and I wanted to see how many duplicates there were, and eliminate them if necessary. I used the command:
"java -Xmx4g -jar $PICARD MarkDuplicates I={input.bam_sort} O={output.final_bam_MARKUP} M={output.metric} ASO=coordinate && samtools flagstat {output.final_bam_MARKUP} > {output.flag} "
and used samtools flagstat and got the file below and the duplicate part is huge.
174209662 + 0 in total (QC-passed reads + QC-failed reads)
3114348 + 0 secondary
0 + 0 supplementary
59374462 + 0 duplicates
112681507 + 0 mapped (64.68% : N/A)
171095314 + 0 paired in sequencing
85547657 + 0 read1
85547657 + 0 read2
94945264 + 0 properly paired (55.49% : N/A)
100496082 + 0 with itself and mate mapped
9071077 + 0 singletons (5.30% : N/A)
5415164 + 0 with mate mapped to a different chr
4365520 + 0 with mate mapped to a different chr (mapQ>=5)
I then wanted to filter with the samtools view command to keep only the proper pairs:
" samtools index -@ {threads} {input.final_bam_MARKUP} && samtools view -q 20 -f 2 -F 3840 --threads {threads} -b {input.final_bam_MARKUP} -> {output.final_bam} && samtools flagstat {output.final_bam} > {output.flag2} "
but it crashes and doesn't want to filter me and displays it this error:
[main_samview] region "-" specifies an invalid region or unknown reference. Continue anyway.
I tried with samtools markdup with this method but it gave me the same error than with picard, then I filtered before the proper pairs and I eliminated the duplicates directly with samtools markup, it works in this sense and gives me this flagstat:
Flagstat flitered for proper pairs and duplicates markup:
91372054 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
0 + 0 supplementary
48231215 + 0 duplicates
91372054 + 0 mapped (100.00% : N/A)
91071666 + 0 paired in sequencing
45535833 + 0 read1
45535833 + 0 read2
91071666 + 0 properly paired (100.00% : N/A)
91071666 + 0 with itself and mate mapped
0 + 0 singletons (0.00% : N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)
Flagstat samtools markup removes duplicates in propers pairs:
43140839 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
43140839 + 0 mapped (100.00% : N/A)
43119758 + 0 paired in sequencing
21559879 + 0 read1
21559879 + 0 read2
43119758 + 0 properly paired (100.00% : N/A)
43119758 + 0 with itself and mate mapped
0 + 0 singletons (0.00% : N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)
However, when I want to make my counts with htseq on this BAM filtered, the count is not done because it seems that my gff file does not correspond to my sam/bam. But when I use htseq on a BAM without picard, so without marking duplicates everything works.
I wonder if it could be due to the RG of my BAM which is not regulatory, or is it something else. I confess I don't understand.
I hope you can help me.
Thank you in advance,
Have a nice day,
Aka
I imagine you wanted
Oh! I am sorry I was sure that the error was in the BAM file ... Thank you very much it works!
It's a naive question but -> and > is not the same things?
>
is redirection ,->
is nothing known to me in bash....