Question

How do tools parse the QNAME?

1

Entering edit mode

9.3 years ago

John 13k

Hey all :)

The SAM spec is pretty lenient on what is and is not allowed in a QNAME. The only really relevant part is:

QNAME: Query template NAME. Reads/segments having identical QNAME are regarded to come from the same template. A QNAME ‘*’ indicates the information is unavailable. In a SAM file, a read may occupy multiple alignment lines, when its alignment is chimeric or when multiple mappings are given.

So each template gets a unique QNAME, and the same QNAME can appear multiple times in the file (for multiple sequenced reads, secondary alignments, etc).

The problem i'm facing right now is that there's obviously a lot more to QNAMEs than just that. First off, I thought a template was not a read, but the whole fragment as it entered the sequencer - so paired-end reads are two reads from the same template, so they should have the same QNAME. In other words, putting /1 and /2 or anything else at the ends of the QNAME to denote the read pair should go against the standard?

The QNAMEs often also encodes the position in the flowcell that the template came from, which some programs use to detect optical duplicates. However, I suspect that duplicate detection is exactly the same without it - just those duplicates are marked as PCR duplicates rather than Optical (which is information usually thrown away anyway).

So my question is, if I was to rename all the QNAMEs in a BAM to something unique to the template, but without the flowcell info nor the /1 /2 mate info, would that cause any actual problems downstream? Are there any tools that just wont work if I just rigorously follow the standards here? If so, what would a good 'fake' QNAME look like?

BAM • 3.5k views

ADD COMMENT • link updated 9.3 years ago by Devon Ryan 105k • written 9.3 years ago by John 13k

1

Entering edit mode

putting /1 and /2 or anything else at the ends of the QNAME to denote the read pair should go against the standard?

Yes, that is against the standard and will break tools. Don't do that. /1 and /2 go to FLAG.

ADD REPLY • link 9.3 years ago by lh3 33k

0

Entering edit mode

Awesome, hahah - I thought I was going mad for a moment, but glad the QNAMEs are (or at least should be) exactly what you said they should be when you wrote the spec :)

ADD REPLY • link 9.3 years ago by John 13k

score 3 · Accepted Answer · 2016-04-16

3

Entering edit mode

9.3 years ago

Devon Ryan 105k

Short answer: read names don't matter except for sorting (by read name, not coordinate), pairing (e.g., counting with featureCounts or htseq-count), and marking optical duplicates. Of course optical duplicates will get marked as duplicates regardless, so who really cares about that.

BTW, typically aligners strip /1 and such off, though not always. One should generally not rely on qnames for anything unless absolutely needed.

ADD COMMENT • link 9.3 years ago by Devon Ryan 105k

1

Entering edit mode

Of course optical duplicates will get marked as duplicates regardless, so who really cares about that.

This is an important consideration with patterned flowcells (to see if the duplicates are optical). I have been trying Picard MarkDuplicates option to identify these. With limited number of samples I have looked at, this has not worked reliably (get some duplicates but none have been marked optical with the settings suggested by GATK tutorials, unless the lab did a great job of loading the flowcells with just the right concentration of libraries).

ADD REPLY • link 9.3 years ago by GenoMax 153k

0

Entering edit mode

One method recommended by the GATK for determining sequencing efficiency is to mark duplicates per-lane, then mark duplicates again once you've merged the lanes.

It took me some time to figure out why anyone would do this, but then it "clicked" that it must be because any extra duplicates marked in the second round of deduping must be solely from the PCR process, so with that you should be able to figure out an estimate (although im not sure how you calculate it exactly) of PCR duplication vs non-PCR duplication (.'. optical duplication) without having to deal with pixel-distances, etc. I heard that the way Illumina reports pixel distances changed (or something like that) so the tools like MarkDuplicates require a pixel distance threshold of either 10 or 1000 (or some other several-orders-of-magnitude-difference like that) and there's no easy way to tell which you need. If you're getting literally 0 optical dupes, that could be why.

Anyway, its no longer part of the GATK best-practices to mark dupes twice, because it's not very exciting and FastQC has some pretty good metrics for that now.

ADD REPLY • link 9.3 years ago by John 13k

1

Entering edit mode

@John: This is a patterned flowcell specific issue and is due to "pad-hopping" or contamination of nanowells nearby during ExAmp clustering. This is related to library characteristics and loading concentration. I saw that recently discussed here. Sounds like we will get a new tag that will hopefully be consumed by Picard in near future.

ADD REPLY • link 9.3 years ago by GenoMax 153k

0

Entering edit mode

I hadn't realized that the optical duplicate rate had gotten high enough to matter on patterned flow cells. That would indeed be an issue for them then.