Hey all :)
The SAM spec is pretty lenient on what is and is not allowed in a QNAME. The only really relevant part is:
QNAME: Query template NAME. Reads/segments having identical QNAME are regarded to come from the same template. A QNAME ‘*’ indicates the information is unavailable. In a SAM file, a read may occupy multiple alignment lines, when its alignment is chimeric or when multiple mappings are given.
So each template gets a unique QNAME, and the same QNAME can appear multiple times in the file (for multiple sequenced reads, secondary alignments, etc).
The problem i'm facing right now is that there's obviously a lot more to QNAMEs than just that. First off, I thought a template was not a read, but the whole fragment as it entered the sequencer - so paired-end reads are two reads from the same template, so they should have the same QNAME. In other words, putting /1 and /2 or anything else at the ends of the QNAME to denote the read pair should go against the standard?
The QNAMEs often also encodes the position in the flowcell that the template came from, which some programs use to detect optical duplicates. However, I suspect that duplicate detection is exactly the same without it - just those duplicates are marked as PCR duplicates rather than Optical (which is information usually thrown away anyway).
So my question is, if I was to rename all the QNAMEs in a BAM to something unique to the template, but without the flowcell info nor the /1 /2 mate info, would that cause any actual problems downstream? Are there any tools that just wont work if I just rigorously follow the standards here? If so, what would a good 'fake' QNAME look like?
Yes, that is against the standard and will break tools. Don't do that. /1 and /2 go to FLAG.
Awesome, hahah - I thought I was going mad for a moment, but glad the QNAMEs are (or at least should be) exactly what you said they should be when you wrote the spec :)