Question

bamUtils dedup and Library information from BAM header

1

Entering edit mode

6.6 years ago

skhan ▴ 10

I have 9 .bam files, produced from 2x75b PE Illumina reads (RNA-Seq) and aligned using STAR to the Ensemble rat reference genome. Each file has one @RG line with only two entries: ID and SM. So for sample s01, the @RG line looks as follows: @RG ID:s01 SM:s01. I have not included any library information (LB:) in the @RG line.

When I run bamUtil's dedup to mark duplicates, I get the following error for each of the 9 .bam files: WARNING: Cannot find library information in the header line @RG ID:s01 SM:s01 . Using empty string for library name

I'm a beginner here. As best as I can tell the duplication marking seems to have worked well.

Should I be concerned that the input .bam files did not have a library defined? If I need to define a library for each .bam file, could you point me to some insights on what to define as the library? e.g. Should I just set the library to the sample name, so that between the 9 .bam files I will have 9 different libraries?

Thanks,

skhan

bam @RG Library marking duplicates bamUtils • 1.9k views

ADD COMMENT • link updated 6.6 years ago by h.mon 35k • written 6.6 years ago by skhan ▴ 10

score 2 · Accepted Answer · 2018-04-15

First: a warning is not an error. With an error, you would get no output, with a warning, you get output, but you may have to be careful and even discard it.

The duplication marking may have worked, but probably not optimally. The intention is to mark PCR and optical duplicates. PCR duplicates appear at library preparation step, optical duplicates form at clusterization step. I don't know the innards of bamUtil duplication marking, but it is likely it uses library information to mark PCR duplicates, so it should be important.

If you loaded each library is to be found only at a single lane, then as is well, but if you loaded the same library on several lanes or sequencing runs, then the marking of duplicates will be non-optimal.

Some background at Read Group In Sam/Bam Files: What Do They Exactly Describe? and Read Groups (GATK forums).