Split a bam file per read
1
0
Entering edit mode
10.1 years ago
s.vdrzeeuw ▴ 20

So I have a full genome PACbio BAM file and this has a average coverage of 3-6.

Now I want to split the bam file per read. Since readgroup is not working and sometimes still gives me multiple reads inside my BAM.

The reason why I want this is because i want to run MPILEUP with only 1 read per bam, so that I can identify insertions/deletions per read on a given location. Anyone knows a program to do this? Or another approach to do it?

Thanks!

BAM manipulation • 3.0k views
ADD COMMENT
1
Entering edit mode
10.1 years ago

I wrote a tool named sam2tsv

It'll show you each base of each read for each position

$ java -jar dist/sam2tsv.jar -A  \
    -r samtools-0.1.18/examples/toy.fa
      samtools-0.1.18/examples/toy.sam
r001    163 ref 0   T   .   7   T   M
r001    163 ref 1   T   .   8   T   M
r001    163 ref 2   A   .   9   A   M
r001    163 ref 3   G   .   10  G   M
r001    163 ref 4   A   .   11  A   M
r001    163 ref 5   T   .   12  T   M
r001    163 ref 6   A   .   13  A   M
r001    163 ref 7   A   .   14  A   M
r001    163 ref 8   A   .   .   .   I
r001    163 ref 9   G   .   .   .   I
r001    163 ref 10  A   .   .   .   I
r001    163 ref 11  G   .   .   .   I
r001    163 ref 12  G   .   15  G   M
r001    163 ref 13  A   .   16  A   M
r001    163 ref 14  T   .   17  T   M
r001    163 ref 15  A   .   18  A   M

UPDATE: and generating one mpileup per read...

samtools view -F 4 ex1.bam | while read L; do (samtools view -H ex1.bam && echo "$L" ) | samtools view -bS - | samtools mpileup -f ex1.fa - ; done
ADD COMMENT
0
Entering edit mode

@Pierre Thanks for your reply, can I also extract one mpileup per read with a filtering region II know try it like this:

while read LINE
do
    chr=$( printf "$LINE" | cut -f1 )
    start=$( printf "$LINE" | cut -f2 )
    stop=$( printf "$LINE" | cut -f3 )
    region=$chr:$start-$stop
    echo "Regions:" $region 
    echo "start:" $start
    echo "stop:" $stop
    echo "line:" $LINE

    mkdir -p results/${region/-/_}

    samtools view -F 4 $inBAM $region  | while read L; do (samtools view -H $inBAM && echo "$L" ) | samtools view -bS - | samtools mpileup -f $refGenome -r $region - > results/${region/-/_}/${FILE%.bam}Mpileup.txt; done
done < $BED

unfortunately this fails. I need to have the mpileup files inside a different region directory so that I can distinguish them. Later on and i will not be left over with 1000 files if I have 200 regions in my bed file. Hope this is also possible thanks for you help already!

this is the error I got:

[bam_header_read] EOF marker is absent. The input is probably truncated.
[samopen] SAM header is present: 93 sequences.
[bam_index_load] fail to load BAM index.
[mpileup] fail to load index for -
ADD REPLY
0
Entering edit mode

run samtools index your.bam before running the loop.

ADD REPLY
0
Entering edit mode

I already have my bam file indexed. So this should not be the case or do you mean after splitting the bam per read.

ADD REPLY
0
Entering edit mode

There is no need to specify the region in mpileup. The reads are already extracted with view

ADD REPLY
0
Entering edit mode

Hmmm, maybe I should rephrase the question, I need to identify repeat regions for each read. Also I need to know there exact lengths and the lengths after counting of insertions and deletions. Therefore I want to have a separate bam file for each read on each repeat region. So that my follow up script can progress those files into a summary table like this:

Chr ::: startpos ::: norm_length ::: calc_length ::: longest repeat ::: shortest repeat ::: ins bases ::: deleted bases

Maybe this clarifies my needs more, still many thanks for investing your time ;)

ADD REPLY

Login before adding your answer.

Traffic: 1537 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6