Question

Split A Bam File Into Smaller Files By Tile Number

0

Entering edit mode

11.6 years ago

gaelgarcia05 ▴ 280

Hi all,

I would like to split a very big BAM file into smaller files for the purpose of annotating it in parallell. Someone suggested splitting it by tile number, which is a good idea since that guarantees that all the alignments for a given read are contained within the same file.

However, I am stuck as to how to phrase the awk command for this purpose, since the tile number is contained within the READ ID string in the first filed of the alignment, separated from the other information in the string by ":" , while this field is separated from the other fields by "\t" .

HWI-ST975:104:C0W47ACXX:8:1101:8269:91631

Tile number (encrypted) = 1101 (5th field) How could I use awk to get each line put into its new corresponding file based on its tile number?

Thanks, Carmen

samtools tophat • 3.5k views

ADD COMMENT • link updated 11.6 years ago by Pierre Lindenbaum 164k • written 11.6 years ago by gaelgarcia05 ▴ 280

0

Entering edit mode

I think i may have a perl solution to this, but I don't know the exact way to phrase the output. Can anybody help me out ? :)

I have made a hash of hashes, where all the lines of a file are sorted into a key of the "master" hash depending on the value of their 5th field.

%Tiles has n keys, where each key is a different $Tile_Number.

Each $Tile_Number opens a new hash that contains all lines whose $Tile_Number was the right number of the current key. The value of each of these new keys (lines) is just 1.

$Tiles{Tile_Number}($Line}=1 , where $Tiles{Tile_Number} has many $Line=1 entries.

I want to print each $Tiles{$Tile_Number} hash in a separate file, preferably, creating the file upon the creation of the $Tile_Number key, and printing as each new $Tiles{$Tile_Number}{$Line}=1 is added, to save memory. The best would be to not print the final value (1), but I can do away with this, I guess..

How can I tell perl to open a new file for each key in the "master" hash and print all of its keys?

Thank you, Carmen

ADD REPLY • link 11.6 years ago by gaelgarcia05 ▴ 280

score 1 · Answer 1 · 2013-05-06

1

Entering edit mode

11.6 years ago

Pierre Lindenbaum 164k

I just wrote a java program to split a BAM by tile:

https://github.com/lindenb/jvarkit/blob/master/src/main/java/com/github/lindenb/jvarkit/tools/splitbytitle/SplitByTile.java

it uses the picard library to parse the BAM.

Compilation:

cd src/main/java
javac -cp path/to/picard.jar:path/to.sam.jar com/github/lindenb/jvarkit/tools/splitbytitle/SplitByTile.java

Execute

java  -cp path/to/picard.jar:path/to.sam.jar \
com.github.lindenb.jvarkit.tools.splitbytitle.SplitByTile \
I=my.bam O=tmp/TILE__TILE__/jeter.__TILE__.bam CREATE_INDEX=true