Entering edit mode
9.7 years ago
Ming Tommy Tang
★
4.5k
Hi,
I am using samblaster to mark duplicates, and it requires the reads to be read id sorted.
Can anyone explain it? I have read sam specification from here:
Dad's data:
@RG ID:FLOWCELL1.LANE1 PL:ILLUMINA LB:LIB-DAD-1 SM:DAD PI:200
@RG ID:FLOWCELL1.LANE2 PL:ILLUMINA LB:LIB-DAD-1 SM:DAD PI:200
@RG ID:FLOWCELL1.LANE3 PL:ILLUMINA LB:LIB-DAD-2 SM:DAD PI:400
@RG ID:FLOWCELL1.LANE4 PL:ILLUMINA LB:LIB-DAD-2 SM:DAD PI:400
Mom's data:
@RG ID:FLOWCELL1.LANE5 PL:ILLUMINA LB:LIB-MOM-1 SM:MOM PI:200
@RG ID:FLOWCELL1.LANE6 PL:ILLUMINA LB:LIB-MOM-1 SM:MOM PI:200
@RG ID:FLOWCELL1.LANE7 PL:ILLUMINA LB:LIB-MOM-2 SM:MOM PI:400
@RG ID:FLOWCELL1.LANE8 PL:ILLUMINA LB:LIB-MOM-2 SM:MOM PI:400
Kid's data:
@RG ID:FLOWCELL2.LANE1 PL:ILLUMINA LB:LIB-KID-1 SM:KID PI:200
@RG ID:FLOWCELL2.LANE2 PL:ILLUMINA LB:LIB-KID-1 SM:KID PI:200
@RG ID:FLOWCELL2.LANE3 PL:ILLUMINA LB:LIB-KID-2 SM:KID PI:400
@RG ID:FLOWCELL2.LANE4 PL:ILLUMINA LB:LIB-KID-2 SM:KID PI:400
The @RG
ID is to identify reads from a specific lane, SM is for the sample name. So, what is the read id? I am a bit confused. My bam file only contains one SM and one ID.
Thank you!
Ming
Posting as a comment because I'm not entirely sure it's correct...
From the context in the samblaster documentation, I suspect that "read-id" is what would normally be called "query name" or "read name" in the spec. In other words, use
samtools sort -n
. That would also make sense given that it explicitly mentions that "read-id" sorting is what aligners produce.Thanks for your reply. my bam files were sorted by coordinates. I might have to sort them by name. I know HTSeq requires bam files to be sorted by name (
-n
), I am not sure whether the same requirement is for samblaster.