split reads for different lanes in BAM files
2
0
Entering edit mode
8.4 years ago
SOHAIL ▴ 410

Hi Everyone, I got per sample per aligned BAM files for already published genomes of human populations. In Header section i saw RG record something like this:

               @RG     ID:LP6005441-DNA_A09    SM:LP6005441-DNA_A09

having information for only RGID and RGSM. But to reproduce the results with GATK best practices, i have to correctly assign RG information.

Each bam I have represents a single sample from a single library prep but they were run on multiple lanes as indicated from the read information, e.g.:

               HS2000-630_102:4:2115:1889:70619
               HS2000-630_102:3:2311:13151:38215
              HS2000-630_102:2:2315:18670:41735

So. to correctly assign the RG information unique for group of reads for each lane, i want to split the per sample BAM files into multiple BAMs with respect to Flowcell lanes. so i can go through replacing the RG information and apply Markduplicates and BQSR procedures correctly.

I am new in this, Could you please suggest any tool or script in order to do my job?

Thanks in advance!

ngs • 3.4k views
ADD COMMENT
0
Entering edit mode

Hi Pierre, Does this work on BAM files?

ADD REPLY
0
Entering edit mode

yes, and it writes bam too.

ADD REPLY
0
Entering edit mode
8.4 years ago

I wrote a tool for a similar job in Advice On Adding Readgroups

see https://github.com/lindenb/jvarkit/wiki/Biostar78400

$ cat input.sam 
@SQ SN:ref  LN:45
@SQ SN:ref2 LN:40
HS2000-1259_127:1:1210:15640:52255  163 ref 7   30  8M4I4M1D3M  =   37  39  
TTAGATAAAGAGGATACTG *   XX:B:S,12561,2,20,112
HS2000-1259_128:2:1210:15640:52255  0   ref 9   30  1S2I6M1P1I1P1I4M2I  *   0   
0   AAAAGATAAGGGATAAA   *

$java -jar dist/biostar78400.jar \
    -x groups.xml \
    input.sam \


@HD VN:1.4  SO:unsorted
@SQ SN:ref  LN:45
@SQ SN:ref2 LN:40
@RG ID:X1   PL:P1   PU:P1   LB:L1   DS:blabla   SM:S1   CN:C1
@RG ID:x2   PL:P2   PU:P2   LB:L2   DS:blabla   SM:S2   CN:C1
HS2000-1259_127:1:1210:15640:52255  163 ref 7   30  8M4I4M1D3M  =   37  39  TTAGATAAAGAGGATACTG *   RG:Z:X1 XX:B:S,12561,2,20,112
HS2000-1259_128:2:1210:15640:52255  0   ref 9   30  1S2I6M1P1I1P1I4M2I  *   0   0AAAAGATAAGGGATAAA  *   RG:Z:x2
ADD COMMENT
0
Entering edit mode
8.4 years ago
SOHAIL ▴ 410

Problem solved! Thanks Pierre for your support at GitHub and here as well. :)

ADD COMMENT

Login before adding your answer.

Traffic: 2754 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6