Hi guys,
I'm looking for an existing tool to see the read distribution over genes in an RNA-seq sample. Just to be clearer, I want something like the output of the script read_distribution.py
into the RseQC toolset but into every gene, not just a resume of the whole alignment file. Something like:
GENE_ID,3'UTR/CDS
gene_1, n
gene_2, n
What I've tried so far is to count specifically for 3' UTR and for CDS using htseq-count
: then I've normalized the count value to the feature length (so a value count/kb
) and after I've done the relationship between the UTR and CDS values.
Now, could this approach considered legit to have a similar information, considering that I'm working with a poorly annotated genome? If no, is there a tool out there which does the task? Coding it by scratch in Python requires time and I don't want to reinvent the weel, so it's my last choice.
Ok but take for example genomecov. It gives you the output of the whole genome, or by each position. It is not designed to give stats about each gene by default giving an annotation file and an alignment one. That's the reason why I've tried the approach with
htseq-count
. The output results, while normalized by feature lengths, seems so different from theread_distribution.py
one.ah indeed
genomecov
does not do that, I meant to writebedtools coverage
instead, for that you need to transform the alignment to bed file first though.when it comes to computing coverages there could be substantial differences in what each tool considers an overlapping read/feature pair (must be full, partial to a threshold, any overlap etc) and also what to do in situations where a read overlaps with multiple features (take both, take one, take none etc), and what to do when a read aligns in multiple regions but only one such region overlaps with a feature etc.
Often the various choices lead to minor differences, in other times there may substantial changes.
bedtools coverage
seems doing what I was looking for, after some adjustment on output. Thanks!