Question

Separate Bam File

1

Entering edit mode

11.4 years ago

filipzembol ▴ 180

Dear all, I have one problem I need to count number of rows and GC content in bam file for individual segments, but I do not know, how to separate it. the size of segment is 60kb. Than for reads, which has start coordinates to 60000 in chr 1, I will receive the count of reads (rows) and GC content. Please could you help me. I have the source code for count NR in awk and GC content, but it is only for all reads in bam file. And I have some source code to separate bed file to each 60kb segments if it help you.

Count of GC

awk '{ n=length($10); print $10, gsub(/[GCCgcs]/,"",$10)/n;}'   your.sam

Count of reads (rows)

awk 'END {print NR}'

Separate bed file:

$ awk \
     '{ \
        regionChromosome = $1; \
         regionStart = $2; \
         regionStop = $3; \
         val = 1; \
          for ( baseStart = regionStart; baseStart < regionStop; baseStart += 60000 ) { \
            baseStop = baseStart + 60000; \
            print regionChromosome"\t"baseStart"\t"baseStop"\t"val++; \
          } \
      }' hg19.extents.bed \
    | bedmap --echo --echo-map-id --delim "\t" 1.bed - > answer.1.bed

Thank you so much.

bam coordinates awk perl • 5.8k views

ADD COMMENT • link 11.4 years ago by filipzembol ▴ 180

Ram · Answer 1 · 2014-03-11

3

Entering edit mode

11.4 years ago

Sean Davis 27k

You might try using samtools:

samtools view my.bam chr1:60000-120000 | your_awk_command

This will pipe all reads that overlap with a region on chromosome 1 between 60kb and 120kb.

ADD COMMENT • link updated 5.8 years ago by Ram 45k • written 11.4 years ago by Sean Davis 27k

0

Entering edit mode

Yes I understand you but iI need to do it for all bam file. It is mean, I separate bam file to each bam files, separate by chromosome (I receive 25 files) and every file I have to separate to each part. Than from 0-60kb; 60kb-120kb,120kb-180kb to the end and for this partitions count GC and fragments.

ADD REPLY • link updated 5.8 years ago by Ram 45k • written 11.4 years ago by filipzembol ▴ 180

1

Entering edit mode

You can specify the chromosome you want:

samtools view my.bam chr1:60000-120000 | your_awk_command
samtools view my.bam chr1:120000-180000 | your_awk_command
samtools view my.bam chr1:180000-240000 | your_awk_command
...
samtools view my.bam chr2:60000-120000 | your_awk_command
samtools view my.bam chr2:120000-180000 | your_awk_command
samtools view my.bam chr2:180000-240000 | your_awk_command

etc.

Of course, I'd recommend creating a small script to do it, but you don't need to pull apart the bam file to operate on different chromosomes.

ADD REPLY • link updated 5.8 years ago by Ram 45k • written 11.4 years ago by Fwip ▴ 500

0

Entering edit mode

Ok, i understand of this, but I separate becasue for cycle will be for me easier than if I have all chrom together, i am totally beginner. I do not have any other idea to do this. Because if I will write all segments by hand, it is unpossible.

ADD REPLY • link 11.4 years ago by filipzembol ▴ 180

1

Entering edit mode

This is a great opportunity to learn some programming and scripting. I would suggest identifying a local "mentor" who can work with you to do so; doing so will be much more efficient than using biostars. If this is not something you feel comfortable with, then I think you might consider identifying a collaborator who can help you with your data, as data analysis is absolutely FULL of small tasks like this that require scripting.

ADD REPLY • link 11.4 years ago by Sean Davis 27k

score 0 · Answer 2 · 2014-03-19

Ok I create a code like this, which solve my problem, but still time I have one problem and I cannot finish it. I would like to create many outputs (files), which has name same like variable $h. Than I want one think, if the for cycle for "h" finished, the code create a text fie with same name like $h.txt . I tried this, but it doesnt work:

 #!/bin/bash 
 for h in {1..22..1} 
 do 
 for i in {0..249480000..60000}
 do 
 u=$i let "u +=60000"

 echo $i"-"$u | samtools view /home/filip/Desktop/Posledni\ NIFTY/019/odfiltrovany019.bam  chr$h:$i-$u | awk '{ n=length($10);print gsub(/[GCCgcs]/,"",$10)/n;}'| awk '{s+=$1}END{print NR,s/NR}' > ${h%}.txt

 done  
 done

Thank you for your help