Counting summary info (count of reads and average GC) for segment with specific length
1
0
Entering edit mode
8.1 years ago
Korsocius ▴ 260

Dear all, I would like to create table from bam file. Where will be information about GC and for read_count in lenght range.

INPUT: tsv file

CHROM           Start   stop   length  GC
chr1             56971  57065   94  0.287234
chr1            565460  565601  141 0.411348
chr1            754342  754488  146 0.520548
chr1            754392  754544  152 0.532895
chr1            754392  754544  152 0.532895
chr1            767020  767159  139 0.345324

Output:

chrom       region        0-10 lenght          10-20   .................190-200 
chr1        0-60000      Read count,mean GC

So, output will be some summarization in 60kb region. where will be information about count of reads and average GC for this count from 5th column and it will be for each length.

Now I have for binarization:

  for z in 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 X Y


do

export $z

for i in {0..249480000..60000}
    do

u=$i
let "u +=60000"

export $i 
export $u 

samtools view *.bam  chr$z:$i-$u | 

done 
done

Thank you Fedik

next-gen genome • 2.7k views
ADD COMMENT
0
Entering edit mode
8.1 years ago
venu 7.1k

I would do something like following. You can loop this over a set of regions or write a small script

paste <(samtools view file.bam 1:0-60000 | wc -l) <(samtools view file.bam 1:0-60000 | awk -F'\t' '{print $10}' | tr -d '\n' | grep -o "GC" | wc -l) <(samtools view file.bam 1:0-60000 | awk -F'\t' '{print $10}' | tr -d '\n' | awk '{print length($1)}') | awk '{print "1:0-60000\t"$1"\t"$2/$3*100}' | sed '1s/^/region\treadCount\tGCper\n/'

Result

region  readCount       GCper
1:0-60000       18237   5.27966
ADD COMMENT
0
Entering edit mode

I ve got this. But I need separate by length in region.... This is biggest problem for me. And for you the easiest way :

for z in *.bam
do

bedtools map -a /home/fil/Desktop/binary_NIFTY.bed -b $z -c 10,10 -o count,concat | awk -v OFS="\t" 'BEGIN {print "CHR\tSTART\tSTOP\tCount_READS\tGCcontent"} {n=length($5); gc=gsub("[gcGC]", "", $5); print $1,$2,$3,$4,gc/n}' > "$z".tsv

done
ADD REPLY
0
Entering edit mode

its great if your problem is solved. I'm experimenting a lot with process substitutions. This solution came as a result of one of experiments. Anyways the problem is solved :)

ADD REPLY
0
Entering edit mode

No it is not :-( I need summary for length distribution. F.e. : for chr1:0-60000 GC and Read count for length 0-10 and GC and Read count for length 10-20 etc to 190-200bp...and for another region...

ADD REPLY
0
Entering edit mode

Change regions and follow above mentioned method. It will work.

ADD REPLY
0
Entering edit mode

I am really lost in space with this :-(

ADD REPLY

Login before adding your answer.

Traffic: 2694 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6