Question

Preprocessing The Bam For Gatk

2

Entering edit mode

12.8 years ago

Pierre Lindenbaum 164k

The GATK often requires that the BAM contains some extra informations in the header ( Group ID, Group Library, SO:coordinate...).

Which commands I should invoke after bwa so my SAM/BAM files (1 sample/file) are valid for the GATK ?

For example, I know I can use picard AddOrReplaceReadGroups to fix some fields in the SAM/header, but I would rather like to insert a awk command in the pipeline after bwa sampe.

Thanks

gatk sam bam • 6.3k views

ADD COMMENT • link updated 12.8 years ago by lh3 33k • written 12.8 years ago by Pierre Lindenbaum 164k

1

Entering edit mode

Don't you trust the picard tools? They usually do a good job! Or do you want to speed up things?

ADD REPLY • link 12.8 years ago by Christof Winter ★ 1.1k

0

Entering edit mode

AddOrReplaceReadGroups just adds an header isn't it ? why should I use it when I can insert some headers on the fly ?

ADD REPLY • link 12.8 years ago by Pierre Lindenbaum 164k

0

Entering edit mode

If I am not mistaken, you would need to change the RG tag of each read as well.

ADD REPLY • link 12.8 years ago by Christof Winter ★ 1.1k

score 4 · Answer 1 · 2012-03-05

4

Entering edit mode

12.8 years ago

lh3 33k

Try:

bwa sampe -r '@RG\tID:foo\tSM:bar\tLB:foobar'

For "SO:coordinate", you may sort with Picard, or sort with samtools and reheader later.

ADD COMMENT • link 12.8 years ago by lh3 33k

0

Entering edit mode

Hi, is actually workimg the '-r' option in the sampe command ? My cmdline is like this: bwa sampe -P ucsc.hg19 -r '@RG\tID:foo\tSM:bar' $FASTQPATH"MAP1.sai" $FASTQPATH"MAP2.sai" $TRIM1fq $TRIM2fq > $FASTQPATH"MAP.sam" but I always get: sampe: invalid option -- 'r'

ADD REPLY • link 11.8 years ago by ff.cc.cc ★ 1.3k

1

Entering edit mode

Use 0.6.2/0.5.10.

ADD REPLY • link 11.8 years ago by lh3 33k

0

Entering edit mode

thanks ! (+1) '-r' will be supported in future ?

ADD REPLY • link 11.8 years ago by ff.cc.cc ★ 1.3k

score 2 · Answer 2 · 2012-03-05

Using Picard is a good idea, the only problem it seems you need to create some temporary files since it does not support using unix pipes.

One of the possibilities is also writing a script to do it, but it is going to be more complex than just one liner in awk. It also depends on how you want to sort your reads, do you want all the reads in the same read group, or do you want to add multiple groups based on additional information (lane, clusters). Adding such information can prove useful in downstream analysis if you're using Genome Analysis Toolkit for quality score recalibration or something like that.

For the script part you first need to add an additional line to the header of the file. This line describes the read group like this:

@RG ID:R-0  PL:Illumina PU:0    LB:R    SM:GM12878_GABP

Than for each read belonging to this group you need to add to the end of line (be careful, cells are tab delimited):

RG:Z:R-0

Some easy script for this probably already exists somewhere, but also rewriting it anew shouldn't be a big difficulty.