Non-Overlapping, de-deduplicated BED file download with exons as separate records
1
0
Entering edit mode
8.2 years ago
abaluapuri • 0

I would like to have a table of all genes (hg19 assembly, RefSeq) with their respective exons as separate records, but without isoforms, duplications or alternative splice variants. How do I download such a bed file ?

Secondly, I would like to calculate percent of tags falling in Exons, Introns and intergenic area of hg19 assembly using the above bed file(and similar for intron etc). Should I use CoverageBed or BedOPS ?

Thanks in advance,

ChIP-Seq Bedtools UCSC Browser Table • 2.1k views
ADD COMMENT
5
Entering edit mode
8.2 years ago
tiago211287 ★ 1.5k

In your linux command line you can use this mysql command to get all exons start and end positions from hg19 with the strand:

mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A -D hg19  -N -e 'select chrom,exonStarts,exonEnds,name2,strand from refGene ' > h19.genes

After that, you run this awk command to separate all exon (comma separated) fields into different rows:

awk '{ n = split($2, a, ","); split($3, b, ","); for(i=1; i<n; ++i) print $1, a[i], b[i], $4, $5 }' h19.genes > h19.genes.bed

you will need to sort the bed file like this:

sort -k1,1V h19.genes.bed > h19.genes.sorted.bed

sometimes the file has spaces instead of tabs and this will crash bedtools, to fix it use:

sed -i 's/ \+/\t/g' h19.genes.bed

And add the 5º column to the 6º, as bedtools merge expect strandness at 6º column

awk -v OFS="\t" '{print $1,$2,$3,$4,$5,$5}' h19.genes.sorted.bed > h19.genes.bed

run bedtools merge to collapse all isoform variants into one:

merge -s -c 4 -o distinct -i h19.genes.bed > tmp && mv tmp h19.genes.bed

go and use the bedtools multicov for count the number of things per bed record

Alternatively, you can use the table browser at http://tinyurl.com/jcd3ftx and set the options you need as in (1 for exons, and other for introns): enter image description here enter image description here

The caveat here is that you do not get the gene names, only ucsc id's (at least for what I know).

For intergenic regions I do not know any simple way to get this. I would construct a BED file simple by subtracting the regions between genes using the coordenates of each gene.

Maybe some more seniour bioinformatician here could help more.

ADD COMMENT

Login before adding your answer.

Traffic: 2445 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6