Entering edit mode
8.5 years ago
sumithrasank75
▴
140
I have a sorted bam file. I was wondering how to get the read counts which are near refseq genes. I used bedtools to get a bed file : bedtools bamtobed -i sorted.bam > sorted.bed I am wondering how to merge this with Refseq gene coordinates.
While not totally clear there is this requirement in original question
hi genomax2,
I noted "near refseq genes" but discounted it assuming a matter of semantics. Nonetheless if the user actually wants to get read counts "near refseq genes" and not "within refseq genes", then a threshold is needed, lets say a flank of 1kb. Then a custom GTF file can be made which would have instead of gene coordinates, flank regions. Like this, for 1kb upstream region -
sample of the file (Ensembl GTF)
Replacing gene coordinates with upstream flank (1kb) -
Working only with gene definition lines of the GTF. The resulting custom GTF can then be fed to HTSeq-count/ featureCounts.
Sinji has already suggested use of
bedtools slop
. That would be okay I think when gene + Flank count is needed. Nonetheless there must be easier way of getting counts from gene flanks only using BED file gene anno. & bedtools or bedops. The only advantage of sticking with HTSeq-count/ featureCounts, I find, is the handling of overlapping reads as noted in the image here.(Have tried editing the GTF snippets for clarity)
If we're interested in only the Flank count then you can go ahead and use
bedtools flank
which will only return flank information and not gene information. Might need to do a bit of awk here, I haven't usedflank
in a while, but it's fairly simple. Of course your way is probably much easier assuming we are using the same flank region on both sides. If not strand information must be taken into account.In that case he could use
bedtools slop
to extend RefSeq genes X amount of bp and then use the above answer i'd assume.