Question

Non-Overlapping, de-deduplicated BED file download with exons as separate records

0

Entering edit mode

8.9 years ago

abaluapuri • 0

I would like to have a table of all genes (hg19 assembly, RefSeq) with their respective exons as separate records, but without isoforms, duplications or alternative splice variants. How do I download such a bed file ?

Secondly, I would like to calculate percent of tags falling in Exons, Introns and intergenic area of hg19 assembly using the above bed file(and similar for intron etc). Should I use CoverageBed or BedOPS ?

Thanks in advance,

ChIP-Seq Bedtools UCSC Browser Table • 2.3k views

ADD COMMENT • link updated 8.9 years ago by tiago211287 ★ 1.5k • written 8.9 years ago by abaluapuri • 0

score 5 · Accepted Answer · 2016-10-02

In your linux command line you can use this mysql command to get all exons start and end positions from hg19 with the strand:

mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A -D hg19  -N -e 'select chrom,exonStarts,exonEnds,name2,strand from refGene ' > h19.genes

After that, you run this awk command to separate all exon (comma separated) fields into different rows:

awk '{ n = split($2, a, ","); split($3, b, ","); for(i=1; i<n; ++i) print $1, a[i], b[i], $4, $5 }' h19.genes > h19.genes.bed

you will need to sort the bed file like this:

sort -k1,1V h19.genes.bed > h19.genes.sorted.bed

sometimes the file has spaces instead of tabs and this will crash bedtools, to fix it use:

sed -i 's/ \+/\t/g' h19.genes.bed

And add the 5º column to the 6º, as bedtools merge expect strandness at 6º column

awk -v OFS="\t" '{print $1,$2,$3,$4,$5,$5}' h19.genes.sorted.bed > h19.genes.bed

run bedtools merge to collapse all isoform variants into one:

merge -s -c 4 -o distinct -i h19.genes.bed > tmp && mv tmp h19.genes.bed

go and use the bedtools multicov for count the number of things per bed record

Alternatively, you can use the table browser at http://tinyurl.com/jcd3ftx and set the options you need as in (1 for exons, and other for introns): enter image description here

The caveat here is that you do not get the gene names, only ucsc id's (at least for what I know).

For intergenic regions I do not know any simple way to get this. I would construct a BED file simple by subtracting the regions between genes using the coordenates of each gene.

Maybe some more seniour bioinformatician here could help more.