Question

a VERY BASIC question about "add or replace groups" for BAM files

1

Entering edit mode

9.8 years ago

CrazyB ▴ 280

Q: What should I put in the read groups in my BAM files?

Yes, I've read about the "read groups" thing on biostars (e.g. Picard provides a very useful tool to add/replace read groups). But I think I've missed something very fundamental so that I still couldn't understand what they are exactly, where I can find them, and if I cannot find them, what I should do so that downstream analyses could proceed (yes, I guess the answer is to add "some" read groups, but what exactly I should add ?)

From what I've found, ID, SM, PL, LB seem important read groups (for GATK at least). But if I am to add these read groups to my BAM files, assuming the files don't have them, can I just assign some dummy names to each of them? Okay, PL probably needs to be specific, like either illumina, solid, or others, but does it matter if I assign them all lowercase or should they be all CAP ?? What about the other RGs ?

For example, if I have only one BAM file to add/replace the read groups, could I simply assign "A", "B", "illumina" and "D" for ID, SM, PL, LB respectively.

And if I have two BAM files, could I simply assign "A1, B1, illumina, D1" for file 1 and "A2, B2, illumina, D2) for file 2?

I found that GATK forum mentioned that dummy info is OKAY, so would A,B,C,D like the examples above be fine ? And what exactly are the purposes for these read groups? If they are so essential, why couldn't they be incorporated by default when running early steps (or even 1st step, e.g. from fastq) of NGS data processing ?

Any input on any of the issues in this question will be greatly appreciated. Thank you.

read-group picard • 4.4k views

ADD COMMENT • link updated 2.6 years ago by Ram 44k • written 9.8 years ago by CrazyB ▴ 280

1

Entering edit mode

9.8 years ago

Pierre Lindenbaum 164k

Q: What should I put in the read groups in my BAM files?

group are used when calling : the group/sample-name is used by the callers to label the name of the genotype column(s)
group can be used for QC: "how many reads for this lane/center/sample/etc.. ?"
groups are used to by picard to remove optical duplicates.
(...)

ADD COMMENT • link updated 2.6 years ago by Ram 44k • written 9.8 years ago by Pierre Lindenbaum 164k

Ram · Accepted Answer · 2015-02-06

3

Entering edit mode

9.8 years ago

Devon Ryan 104k

Yes, you can assign dummy names for any and all of these. The read group tags are meant to enable grouping of alignment to account for biases due to things like the library preparation, the machine things were sequenced on, etc.

This is mostly useful where you have samples that were each sequenced multiple times, but from different libraries. So then you'd have alignments with the same SM but a different LB. In cases where you just have a single run of each sample, with all samples done in a single batch, then read groups aren't particularly useful.

ADD COMMENT • link updated 2.6 years ago by Ram 44k • written 9.8 years ago by Devon Ryan 104k

0

Entering edit mode

Thanks A LOT for the clarification !! From a non-techie person perspective, it's still a little odd though that the info is NOT registered in the fastq output. Shouldn't it be generated automatically when machines do the sequencing? If so, couldn't it be extracted automatically and directly from the fastq output (or whatever the raw sequencing output format is)?

ADD REPLY • link updated 2.6 years ago by Ram 44k • written 9.8 years ago by CrazyB ▴ 280

0

Entering edit mode

Oh - I apologize for not doing a more comprehensive search (I thought I did) on biostars forum before I posted my question. Apparently a similar question was asked 2.8 years ago.

ADD REPLY • link updated 2.6 years ago by Ram 44k • written 9.8 years ago by CrazyB ▴ 280