Question

Split a bam file per read

0

Entering edit mode

10.9 years ago

s.vdrzeeuw ▴ 20

So I have a full genome PACbio BAM file and this has a average coverage of 3-6.

Now I want to split the bam file per read. Since readgroup is not working and sometimes still gives me multiple reads inside my BAM.

The reason why I want this is because i want to run MPILEUP with only 1 read per bam, so that I can identify insertions/deletions per read on a given location. Anyone knows a program to do this? Or another approach to do it?

Thanks!

BAM manipulation • 3.3k views

ADD COMMENT • link updated 4.4 years ago by Ram 45k • written 10.9 years ago by s.vdrzeeuw ▴ 20

Ram · Answer 1 · 2014-10-14

1

Entering edit mode

10.9 years ago

Pierre Lindenbaum 166k

I wrote a tool named sam2tsv

It'll show you each base of each read for each position

$ java -jar dist/sam2tsv.jar -A  \
    -r samtools-0.1.18/examples/toy.fa
      samtools-0.1.18/examples/toy.sam
r001    163 ref 0   T   .   7   T   M
r001    163 ref 1   T   .   8   T   M
r001    163 ref 2   A   .   9   A   M
r001    163 ref 3   G   .   10  G   M
r001    163 ref 4   A   .   11  A   M
r001    163 ref 5   T   .   12  T   M
r001    163 ref 6   A   .   13  A   M
r001    163 ref 7   A   .   14  A   M
r001    163 ref 8   A   .   .   .   I
r001    163 ref 9   G   .   .   .   I
r001    163 ref 10  A   .   .   .   I
r001    163 ref 11  G   .   .   .   I
r001    163 ref 12  G   .   15  G   M
r001    163 ref 13  A   .   16  A   M
r001    163 ref 14  T   .   17  T   M
r001    163 ref 15  A   .   18  A   M

UPDATE: and generating one mpileup per read...

samtools view -F 4 ex1.bam | while read L; do (samtools view -H ex1.bam && echo "$L" ) | samtools view -bS - | samtools mpileup -f ex1.fa - ; done

ADD COMMENT • link updated 4.4 years ago by Ram 45k • written 10.9 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

@Pierre Thanks for your reply, can I also extract one mpileup per read with a filtering region II know try it like this:

while read LINE
do
    chr=$( printf "$LINE" | cut -f1 )
    start=$( printf "$LINE" | cut -f2 )
    stop=$( printf "$LINE" | cut -f3 )
    region=$chr:$start-$stop
    echo "Regions:" $region 
    echo "start:" $start
    echo "stop:" $stop
    echo "line:" $LINE

    mkdir -p results/${region/-/_}

    samtools view -F 4 $inBAM $region  | while read L; do (samtools view -H $inBAM && echo "$L" ) | samtools view -bS - | samtools mpileup -f $refGenome -r $region - > results/${region/-/_}/${FILE%.bam}Mpileup.txt; done
done < $BED

unfortunately this fails. I need to have the mpileup files inside a different region directory so that I can distinguish them. Later on and i will not be left over with 1000 files if I have 200 regions in my bed file. Hope this is also possible thanks for you help already!

this is the error I got:

[bam_header_read] EOF marker is absent. The input is probably truncated.
[samopen] SAM header is present: 93 sequences.
[bam_index_load] fail to load BAM index.
[mpileup] fail to load index for -

ADD REPLY • link updated 4.4 years ago by Ram 45k • written 10.9 years ago by s.vdrzeeuw ▴ 20

0

Entering edit mode

run samtools index your.bam before running the loop.

ADD REPLY • link updated 4.4 years ago by Ram 45k • written 10.9 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

I already have my bam file indexed. So this should not be the case or do you mean after splitting the bam per read.

ADD REPLY • link updated 4.4 years ago by Ram 45k • written 10.9 years ago by s.vdrzeeuw ▴ 20

0

Entering edit mode

There is no need to specify the region in mpileup. The reads are already extracted with view

ADD REPLY • link updated 4.4 years ago by Ram 45k • written 10.9 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

Hmmm, maybe I should rephrase the question, I need to identify repeat regions for each read. Also I need to know there exact lengths and the lengths after counting of insertions and deletions. Therefore I want to have a separate bam file for each read on each repeat region. So that my follow up script can progress those files into a summary table like this:

Chr ::: startpos ::: norm_length ::: calc_length ::: longest repeat ::: shortest repeat ::: ins bases ::: deleted bases

Maybe this clarifies my needs more, still many thanks for investing your time ;)

ADD REPLY • link updated 4.4 years ago by Ram 45k • written 10.9 years ago by s.vdrzeeuw ▴ 20