Question

Gencode Gtf To Bed12 Or Psl

4

Entering edit mode

13.1 years ago

Geparada ★ 1.5k

Hi!

I has been using gene models available in UCSC Table Browser for a long time. Those are in BED12 or PSL and I made many scripts to process this kind of format, but now I want to work with the last released version of Gencode (v9) which are only in GTF format.

Do you know a script to convert GTF in BED12 or PSL?

Thanks for you help!

ucsc annotation gtf bed • 17k views

ADD COMMENT • link updated 9.0 years ago by Czh3 ▴ 190 • written 13.1 years ago by Geparada ★ 1.5k

1

Entering edit mode

 

#!/usr/PATH/TO/python3

'''

gtf2bed.py converts GTF file to BED file.

Usage: gtf2bed.py {OPTIONS} [.GTF file]

History

Nov.5th 2012:

1. Allow conversion from general GTF files (instead of only Cufflinks supports).

2. If multiple identical transcript_id exist, transcript_id will be appended a string like "_DUP#" to separate.

'''



import sys;

import re;



if len(sys.argv)<2:

print('This script converts .GTF into .BED annotations.\n');

print('Usage: gtf2bed {OPTIONS} [.GTF file]\n');

print('Options:');

print('-c color\tSpecify the color of the track. This is a RGB value represented as "r,g,b". Default 255,0,0 (red)');

print('\nNote:');

print('1\tOnly "exon" and "transcript" are recognized in the feature field (3rd field).');

print('2\tIn the attribute list of .GTF file, the script tries to find "gene_id", "transcript_id" and "FPKM" attribute, and convert them as name and score field in .BED file.');

print('Author: Wei Li (li.david.wei AT gmail.com)');

sys.exit();



color='255,0,0'





for i in range(len(sys.argv)):

if sys.argv[i]=='-c':

color=sys.argv[i+1];





allids={};



def printbedline(estart,eend,field,nline):

try:

estp=estart[0]-1;

eedp=eend[-1];

# use regular expression to get transcript_id, gene_id and expression level

geneid=re.findall(r'gene_id \"([\w\.]+)\"',field[8])

transid=re.findall(r'transcript_id \"([\w\.]+)\"',field[8])

fpkmval=re.findall(r'FPKM \"([\d\.]+)\"',field[8])

if len(geneid)==0:

print('Warning: no gene_id field ',file=sys.stderr);

else:

geneid=geneid[0];

if len(transid)==0:

print('Warning: no transcript_id field',file=sys.stderr);

transid='Trans_'+str(nline);

else:

transid=transid[0];

if transid in allids.keys():

transid2=transid+'_DUP'+str(allids[transid]);

allids[transid]=allids[transid]+1;

transid=transid2;

else:

allids[transid]=1;

if len(fpkmval)==0:

#print('Warning: no FPKM field',file=sys.stderr);

fpkmval='100';

else:

fpkmval=fpkmval[0];

fpkmint=round(float(fpkmval));

print(field[0]+'\t'+str(estp)+'\t'+str(eedp)+'\t'+transid+'\t'+str(fpkmint)+'\t'+field[6]+'\t'+str(estp)+'\t'+str(eedp)+'\t'+color+'\t'+str(len(estart))+'\t',end='');

seglen=[eend[i]-estart[i]+1 for i in range(len(estart))];

segstart=[estart[i]-estart[0] for i in range(len(estart))];

strl=str(seglen[0]);

for i in range(1,len(seglen)):

strl+=','+str(seglen[i]);

strs=str(segstart[0]);

for i in range(1,len(segstart)):

strs+=','+str(segstart[i]);

print(strl+'\t'+strs);

except ValueError:

print('Error: non-number fields at line '+str(nline),file=sys.stderr);













estart=[];

eend=[];

# read lines one to one

nline=0;

prevfield=[];

prevtransid='';

for lines in open(sys.argv[-1]):

field=lines.strip().split('\t');

nline=nline+1;

if len(field)<9:

print('Error: the GTF should has at least 9 fields at line '+str(nline),file=sys.stderr);

continue;

if field[1]!='Cufflinks':

pass;

#print('Warning: the second field is expected to be \'Cufflinks\' at line '+str(nline),file=sys.stderr);

if field[2]!='exon' and field[2] !='transcript':

#print('Error: the third filed is expected to be \'exon\' or \'transcript\' at line '+str(nline),file=sys.stderr);

continue;

transid=re.findall(r'transcript_id \"([\w\.]+)\"',field[8]);

if len(transid)>0:

transid=transid[0];

else:

transid='';

if field[2]=='transcript' or (prevtransid != '' and transid!='' and transid != prevtransid):

#print('prev:'+prevtransid+', current:'+transid);

# A new transcript record, write

if len(estart)!=0:

printbedline(estart,eend,prevfield,nline);

estart=[];

eend=[];

prevfield=field;

prevtransid=transid;

if field[2]=='exon':

try:

est=int(field[3]);

eed=int(field[4]);

estart+=[est];

eend+=[eed];

except ValueError:

print('Error: non-number fields at line '+str(nline),file=sys.stderr);

# the last record

if len(estart)!=0:

printbedline(estart,eend,field,nline);

ADD REPLY • link updated 9.5 years ago by Gjain 5.8k • written 9.5 years ago by xiachongjing ▴ 10

score 3 · Answer 1 · 2013-02-16

3

Entering edit mode

11.8 years ago

sjneph ▴ 690

Take a look at gtf2bed. As with all conversion utilities available in BEDOPS, there is no loss of information in the conversions to a BED-like format supported by that tool suite.

ADD COMMENT • link 11.8 years ago by sjneph ▴ 690

score 2 · Answer 2 · 2011-11-21

The UCSC source tree has utilities for gtfToGenePred (convert a GTF file to a genePred) and genePredToPsl (Program to create fake psl alignments from genePred records). The latter can output BED with the -bedFormat option. These (like all source tree utilities) can be piped together do the conversion you are looking for. The source tree and instructions for compilation are here: http://hgdownload.cse.ucsc.edu/admin/jksrc.zip

score 2 · Answer 3 · 2015-12-09

2

Entering edit mode

9.0 years ago

Czh3 ▴ 190

This script may help you.

https://github.com/Czh3/NGSTools/blob/master/script/gtf2bed12.sh

ADD COMMENT • link 9.0 years ago by Czh3 ▴ 190

1

Entering edit mode

Hi,

just a note that the script is a really nice idea based on UCSC tools. However, currently it's not converting to the correct format.

Your Column 11: exon start site relative to chromosome start

Bed12 Column 11: exon length

Your Column 12: exon end site relative to chromosome start

Bed12 Column 12: exon start site relative to feature start

Best wishes, Jakub

ADD REPLY • link 8.7 years ago by jakub ▴ 40

0

Entering edit mode

Thank you for pointing this out! Furthermore, there is also an UCSC tool genePredToBed: GitHub link. Thus you could use the output of gtfToGenePred as input of genePredToBed. Something like

gtfToGenePred input.gtf output.tmp; genePredToBed output.tmp output.bed; rm output.tmp

ADD REPLY • link 7.7 years ago by e.rempel ★ 1.1k

score 1 · Answer 4 · 2011-11-21

1

Entering edit mode

13.1 years ago

Gjain 5.8k

The gencode V9 GTF format is similar to their V3 format.

this link should help you with everything what you are looking for.

http://www.sanger.ac.uk/resources/databases/encode/gencodeformat.html

hope this helps.

ADD COMMENT • link 13.1 years ago by Gjain 5.8k

0

Entering edit mode

Thanks for the link, now I understand the fields meaning, but I asked for some script to convert this format to BED12. Well, I think I have to write it myself...

ADD REPLY • link 13.1 years ago by Geparada ★ 1.5k

0

Entering edit mode

yup. all you need to do is parse the fields and then just rearrange it for the format you want. let me know if you need help in that.

ADD REPLY • link 13.1 years ago by Gjain 5.8k

score 1 · Answer 5 · 2012-02-07

1

Entering edit mode

12.9 years ago

Statler ▴ 10

Actually, use the ucsc table browser in the Preview version of the genome browser, you select the genes/gene prediction table, then the appropriate dataset.

Select to download as bed format and you are done.

ADD COMMENT • link 12.9 years ago by Statler ▴ 10

0

Entering edit mode

Since not much time ago I'm using the Preview version of the genome browser... I didn't know about it when I did this question!

ADD REPLY • link 12.8 years ago by Geparada ★ 1.5k

score 0 · Answer 6 · 2011-11-23

0

Entering edit mode

13.1 years ago

Anna ▴ 140

hi, I am having the same problem to convert from gff to bed12 - since the latter is a requirement for coverageBed -split. any ideas? or should I write my own thing?

thanks!

Anna

ADD COMMENT • link 13.1 years ago by Anna ▴ 140

2

Entering edit mode

Sure, take a look at gff2bed, and the bedmap program from the same tool suite. No reason to convert to bed12 - the gff2bed script will fully convert all information for you to a BED-like format that bedmap can use to give you the coverage information you want.

ADD REPLY • link 11.8 years ago by sjneph ▴ 690

0

Entering edit mode

@anna, you should delete this answer and re-post as a new question.

ADD REPLY • link 13.1 years ago by Casey Bergman 18k

0

Entering edit mode

Could you please tell in more detail what problems you are facing? I have a script i wrote which does the same.

ADD REPLY • link 13.1 years ago by Gjain 5.8k