Question

Locus Reference Genomic (LRG) GTF

0

Entering edit mode

7.7 years ago

Robert Sicko ▴ 640

Is anyone aware of a source of LRG data in GTF format? I've tried searching "convert (any of the forms available on their download page) to GTF" but I've found nothing straight forward.

I'm trying to annotate regions of missing coverage from a clinical targeted NGS panel with a label of intron, exon or utr and using the LRG, I've found, simplifies things. I tried something similar to this to generate GTF files from GENCODE data, but the alternative transcripts for many genes are causing problems.

LRG clinical genomics NGS GTF bed • 2.1k views

ADD COMMENT • link 7.7 years ago by Robert Sicko ▴ 640

score 0 · Answer 1 · 2017-11-03

For anyone who comes across this problem, I figured out a solution using the UCSC table browser and extract-transcript-regions

I obtained the refseq ID for each LRG I was interested in using the xref file provided on LRG's download page. Then I used the UCSC table browser to download a GTF file using the refseq IDs. Finally I Stephen Floor's extract-transcript-regions script to convert this GTF to a bed file.

Note: this script converts to bed12 or blockbed. You can use bedtools bed12ToBed6 to convert the bed6 file. I used the -n option to use score field to track feature (exon) number; downstream I used awk to bring this number back into a label in my final bed file.

python extract_transcript_regions.py -i genes.gtf -o genes --gtf

now convert this blockbed (bed12) to bed6

cat genes_cds.bed | bed12ToBed6 -i stdin -n > genes_cds_bed6.bed

cat genes_3utr.bed | bed12ToBed6 -i stdin -n > genes_3utr_bed6.bed

cat genes_5utr.bed | bed12ToBed6 -i stdin -n > genes_5utr_bed6.bed

cat genes_introns.bed | bed12ToBed6 -i stdin -n > genes_introns_bed6.bed

I wanted my final bed to have 10bp of each intron (5' and 3').

awk '{OFS="\t"} {print $1,$2,$2+10,$4"_"$5"_L",0,$6"\n"$1,$3-10,$3,$4"_"$5"_R",0,$6}' genes_introns_bed6.bed > genes_introns_10bp_bed6.bed

I also wanted to trim UTRs to 200bp. for 3':

awk '{OFS="\t"} {if($3-$2 >200 && $6 =="-") {print $1,$3-200,$3,$4"_"$5,0,$6;} else if($3-$2 >200 && $6 =="+") {print $1,$2,$2+200,$4"_"$5,0,$6;} else {print $1,$2,$3,$4"_"$5,0,$6;}}' genes_3utr_bed6.bed > genes_3utr_200bp_bed6.bed

for 5':

awk '{OFS="\t"} {if($3-$2 >200 && $6 =="-") {print $1,$2,$2+200,$4"_"$5,0,$6;} else if($3-$2 >200 && $6 =="+") {print $1,$3-200,$3,$4"_"$5,0,$6;} else {print $1,$2,$3,$4"_"$5,0,$6;}}' genes_5utr_bed6.bed > genes_5utr_200bp_bed6.bed

now clean up the labels on the CDS bed

awk '{OFS="\t"} {print $1,$2,$3,$4"_"$5,0,$6}' genes_cds_bed6.bed > genes_cds_cleaned_bed6.bed

finally, combine and sort all of the bed files to create my final region of interest bed file

cat genes_introns_10bp_bed6.bed genes_cds_cleaned_bed6.bed genes_5utr_200bp_bed6.bed genes_3utr_200bp_bed6.bed | sortBed -i - > genes_final_roi.sorted.bed