I found that one of my samples had many more reads than the others (~113 million vs 30 million in the flagstat output for the BAM files). I ran samtools markdup after sorting the .bam file by name, tagging with fixmate, then sorting by position for markdup. Below is the markdup stats output, showing that I ended up with about 33 million reads in the markdup.bam file after duplicate removal.
COMMAND:
samtools markdup -r -s Gerson-54_paired_pec.positionsort.bam Gerson-54_paired_pec.markdup.bam
READ: 113282791
WRITTEN: 33036290
EXCLUDED: 0
EXAMINED: 113282791
PAIRED: 106910760
SINGLE: 6372031
DUPLICATE PAIR: 74415464
DUPLICATE SINGLE: 5831037
DUPLICATE PAIR OPTICAL: 0
DUPLICATE SINGLE OPTICAL: 0
DUPLICATE NON PRIMARY: 0
DUPLICATE NON PRIMARY OPTICAL: 0
DUPLICATE PRIMARY TOTAL: 80246501
DUPLICATE TOTAL: 80246501
ESTIMATED_LIBRARY_SIZE: 16975904
I then sorted this markdup.bam file by name and tried to run HTseq (htseq-count -f bam -r name --stranded=yes
), but I got the following message:
Error occured when processing BAM input (record #6 in file Gerson-54_paired_pec.markdup.sorted.bam): Sequence of paired-end alignments expected, but got single-end alignment.
I see that in the markdup stats that there are singles, but I'm not sure how to deal with these as they appear to be interfering with the HTseq-count. Am I doing something wrong? Is there a way to eliminate the singe-end alignments, or is there a better way to deal with the issue?
Thanks for the help!
Thank you! Seems like a simple answer, but it means a lot for this novice!