Hey guys!
I'm trying to calculate one thing and I'm sure someone else has also done this. So maybe someone have a script ready!
I have a draft de novo genome, and augustus predicted too many genes (54 thousand), so I've aligned my RNA-seq to the exons predicted to discharge the exons with bad (5 or less) bases covered by rna-seq reads.
So, I have submitted my bam.sorted file to genomeCoverageBed script with the flag -bga, so I can have the info of 0 coverage.
My output is like this:
itr6_2569_pilon_pilon_pilon_pi.g4.t1.cds1 0 62 0
itr6_2569_pilon_pilon_pilon_pi.g4.t1.cds1 89 90 0
itr6_2569_pilon_pilon_pilon_pi.g4.t1.cds1 224 225 0
itr6_2569_pilon_pilon_pilon_pi.g4.t1.cds1 252 253 0
Which means, for example for line 1, the exon 'tr6_2569_pilon_pilon_pilon_pi.g4.t1.cds1' has zero coverage in the interval 0 to 62 bases. Next line shows another interval for the same exon...
I also have a file with the exon IDs and their total length. Like this:
tr6_2575_pilon_pilon_pilon_pi.g61.t1.cds5 105
itr6_2575_pilon_pilon_pilon_pi.g61.t1.cds6 69
itr6_2575_pilon_pilon_pilon_pi.g61.t1.cds7 114
itr6_2575_pilon_pilon_pilon_pi.g61.t1.cds8 90
So, the info I would like to extract is: to select the exons that have 70% or more of its total sequence covered by 0 bases.. So I can take them out of my genome prediction.
Does anyone has done that and can help me?
Thank you so much!
Have you considered just using the TransRate package?
Thanks, Devon! I'll have a look!
I was also thinking in a simple solution: maybe just using 'samtools indxstat' with my bam file. I wont have information of coverage per base, but I'll have size of contig and number of reads mapped, so a raw estimate I can do from there.
I'll play around with different things today.. Thanks!