Question

split reads for different lanes in BAM files

0

Entering edit mode

8.8 years ago

SOHAIL ▴ 410

Hi Everyone, I got per sample per aligned BAM files for already published genomes of human populations. In Header section i saw RG record something like this:

               @RG     ID:LP6005441-DNA_A09    SM:LP6005441-DNA_A09

having information for only RGID and RGSM. But to reproduce the results with GATK best practices, i have to correctly assign RG information.

Each bam I have represents a single sample from a single library prep but they were run on multiple lanes as indicated from the read information, e.g.:

               HS2000-630_102:4:2115:1889:70619
               HS2000-630_102:3:2311:13151:38215
              HS2000-630_102:2:2315:18670:41735

So. to correctly assign the RG information unique for group of reads for each lane, i want to split the per sample BAM files into multiple BAMs with respect to Flowcell lanes. so i can go through replacing the RG information and apply Markduplicates and BQSR procedures correctly.

I am new in this, Could you please suggest any tool or script in order to do my job?

Thanks in advance!

ngs • 3.5k views

ADD COMMENT • link 8.8 years ago by SOHAIL ▴ 410

0

Entering edit mode

Hi Pierre, Does this work on BAM files?

ADD REPLY • link 8.8 years ago by SOHAIL ▴ 410

0

Entering edit mode

yes, and it writes bam too.

ADD REPLY • link 8.8 years ago by Pierre Lindenbaum 166k

score 0 · Answer 1 · 2016-07-19

I wrote a tool for a similar job in Advice On Adding Readgroups

see https://github.com/lindenb/jvarkit/wiki/Biostar78400

$ cat input.sam 
@SQ SN:ref  LN:45
@SQ SN:ref2 LN:40
HS2000-1259_127:1:1210:15640:52255  163 ref 7   30  8M4I4M1D3M  =   37  39  
TTAGATAAAGAGGATACTG *   XX:B:S,12561,2,20,112
HS2000-1259_128:2:1210:15640:52255  0   ref 9   30  1S2I6M1P1I1P1I4M2I  *   0   
0   AAAAGATAAGGGATAAA   *

$java -jar dist/biostar78400.jar \
    -x groups.xml \
    input.sam \


@HD VN:1.4  SO:unsorted
@SQ SN:ref  LN:45
@SQ SN:ref2 LN:40
@RG ID:X1   PL:P1   PU:P1   LB:L1   DS:blabla   SM:S1   CN:C1
@RG ID:x2   PL:P2   PU:P2   LB:L2   DS:blabla   SM:S2   CN:C1
HS2000-1259_127:1:1210:15640:52255  163 ref 7   30  8M4I4M1D3M  =   37  39  TTAGATAAAGAGGATACTG *   RG:Z:X1 XX:B:S,12561,2,20,112
HS2000-1259_128:2:1210:15640:52255  0   ref 9   30  1S2I6M1P1I1P1I4M2I  *   0   0AAAAGATAAGGGATAAA  *   RG:Z:x2

score 0 · Answer 2 · 2016-07-20

0

Entering edit mode

8.8 years ago

SOHAIL ▴ 410

Problem solved! Thanks Pierre for your support at GitHub and here as well. :)

ADD COMMENT • link 8.8 years ago by SOHAIL ▴ 410