Gencode Gtf To Bed12 Or Psl
6
4
Entering edit mode
13.2 years ago
Geparada ★ 1.5k

Hi!

I has been using gene models available in UCSC Table Browser for a long time. Those are in BED12 or PSL and I made many scripts to process this kind of format, but now I want to work with the last released version of Gencode (v9) which are only in GTF format.

Do you know a script to convert GTF in BED12 or PSL?

Thanks for you help!

ucsc annotation gtf bed • 17k views
ADD COMMENT
1
Entering edit mode
 

#!/usr/PATH/TO/python3

'''

gtf2bed.py converts GTF file to BED file.

Usage: gtf2bed.py {OPTIONS} [.GTF file]

History

Nov.5th 2012:

1. Allow conversion from general GTF files (instead of only Cufflinks supports).

2. If multiple identical transcript_id exist, transcript_id will be appended a string like "_DUP#" to separate.

'''



import sys;

import re;



if len(sys.argv)<2:

print('This script converts .GTF into .BED annotations.\n');

print('Usage: gtf2bed {OPTIONS} [.GTF file]\n');

print('Options:');

print('-c color\tSpecify the color of the track. This is a RGB value represented as "r,g,b". Default 255,0,0 (red)');

print('\nNote:');

print('1\tOnly "exon" and "transcript" are recognized in the feature field (3rd field).');

print('2\tIn the attribute list of .GTF file, the script tries to find "gene_id", "transcript_id" and "FPKM" attribute, and convert them as name and score field in .BED file.');

print('Author: Wei Li (li.david.wei AT gmail.com)');

sys.exit();



color='255,0,0'





for i in range(len(sys.argv)):

if sys.argv[i]=='-c':

color=sys.argv[i+1];





allids={};



def printbedline(estart,eend,field,nline):

try:

estp=estart[0]-1;

eedp=eend[-1];

# use regular expression to get transcript_id, gene_id and expression level

geneid=re.findall(r'gene_id \"([\w\.]+)\"',field[8])

transid=re.findall(r'transcript_id \"([\w\.]+)\"',field[8])

fpkmval=re.findall(r'FPKM \"([\d\.]+)\"',field[8])

if len(geneid)==0:

print('Warning: no gene_id field ',file=sys.stderr);

else:

geneid=geneid[0];

if len(transid)==0:

print('Warning: no transcript_id field',file=sys.stderr);

transid='Trans_'+str(nline);

else:

transid=transid[0];

if transid in allids.keys():

transid2=transid+'_DUP'+str(allids[transid]);

allids[transid]=allids[transid]+1;

transid=transid2;

else:

allids[transid]=1;

if len(fpkmval)==0:

#print('Warning: no FPKM field',file=sys.stderr);

fpkmval='100';

else:

fpkmval=fpkmval[0];

fpkmint=round(float(fpkmval));

print(field[0]+'\t'+str(estp)+'\t'+str(eedp)+'\t'+transid+'\t'+str(fpkmint)+'\t'+field[6]+'\t'+str(estp)+'\t'+str(eedp)+'\t'+color+'\t'+str(len(estart))+'\t',end='');

seglen=[eend[i]-estart[i]+1 for i in range(len(estart))];

segstart=[estart[i]-estart[0] for i in range(len(estart))];

strl=str(seglen[0]);

for i in range(1,len(seglen)):

strl+=','+str(seglen[i]);

strs=str(segstart[0]);

for i in range(1,len(segstart)):

strs+=','+str(segstart[i]);

print(strl+'\t'+strs);

except ValueError:

print('Error: non-number fields at line '+str(nline),file=sys.stderr);













estart=[];

eend=[];

# read lines one to one

nline=0;

prevfield=[];

prevtransid='';

for lines in open(sys.argv[-1]):

field=lines.strip().split('\t');

nline=nline+1;

if len(field)<9:

print('Error: the GTF should has at least 9 fields at line '+str(nline),file=sys.stderr);

continue;

if field[1]!='Cufflinks':

pass;

#print('Warning: the second field is expected to be \'Cufflinks\' at line '+str(nline),file=sys.stderr);

if field[2]!='exon' and field[2] !='transcript':

#print('Error: the third filed is expected to be \'exon\' or \'transcript\' at line '+str(nline),file=sys.stderr);

continue;

transid=re.findall(r'transcript_id \"([\w\.]+)\"',field[8]);

if len(transid)>0:

transid=transid[0];

else:

transid='';

if field[2]=='transcript' or (prevtransid != '' and transid!='' and transid != prevtransid):

#print('prev:'+prevtransid+', current:'+transid);

# A new transcript record, write

if len(estart)!=0:

printbedline(estart,eend,prevfield,nline);

estart=[];

eend=[];

prevfield=field;

prevtransid=transid;

if field[2]=='exon':

try:

est=int(field[3]);

eed=int(field[4]);

estart+=[est];

eend+=[eed];

except ValueError:

print('Error: non-number fields at line '+str(nline),file=sys.stderr);

# the last record

if len(estart)!=0:

printbedline(estart,eend,field,nline);
ADD REPLY
3
Entering edit mode
11.9 years ago
sjneph ▴ 690

Take a look at gtf2bed. As with all conversion utilities available in BEDOPS, there is no loss of information in the conversions to a BED-like format supported by that tool suite.

ADD COMMENT
2
Entering edit mode
13.2 years ago

The UCSC source tree has utilities for gtfToGenePred (convert a GTF file to a genePred) and genePredToPsl (Program to create fake psl alignments from genePred records). The latter can output BED with the -bedFormat option. These (like all source tree utilities) can be piped together do the conversion you are looking for. The source tree and instructions for compilation are here: http://hgdownload.cse.ucsc.edu/admin/jksrc.zip

ADD COMMENT
2
Entering edit mode
9.1 years ago
Czh3 ▴ 190

This script may help you.

https://github.com/Czh3/NGSTools/blob/master/script/gtf2bed12.sh

ADD COMMENT
1
Entering edit mode

Hi,

just a note that the script is a really nice idea based on UCSC tools. However, currently it's not converting to the correct format.

Your Column 11: exon start site relative to chromosome start

Bed12 Column 11: exon length

Your Column 12: exon end site relative to chromosome start

Bed12 Column 12: exon start site relative to feature start

Best wishes, Jakub

ADD REPLY
0
Entering edit mode

Thank you for pointing this out! Furthermore, there is also an UCSC tool genePredToBed: GitHub link. Thus you could use the output of gtfToGenePred as input of genePredToBed. Something like

gtfToGenePred input.gtf output.tmp; genePredToBed output.tmp output.bed; rm output.tmp

ADD REPLY
1
Entering edit mode
13.2 years ago
Gjain 5.8k

The gencode V9 GTF format is similar to their V3 format.

this link should help you with everything what you are looking for.

http://www.sanger.ac.uk/resources/databases/encode/gencodeformat.html

hope this helps.

ADD COMMENT
0
Entering edit mode

Thanks for the link, now I understand the fields meaning, but I asked for some script to convert this format to BED12. Well, I think I have to write it myself...

ADD REPLY
0
Entering edit mode

yup. all you need to do is parse the fields and then just rearrange it for the format you want. let me know if you need help in that.

ADD REPLY
1
Entering edit mode
12.9 years ago
Statler ▴ 10

Actually, use the ucsc table browser in the Preview version of the genome browser, you select the genes/gene prediction table, then the appropriate dataset.

Select to download as bed format and you are done.

ADD COMMENT
0
Entering edit mode

Since not much time ago I'm using the Preview version of the genome browser... I didn't know about it when I did this question!

ADD REPLY
0
Entering edit mode
13.2 years ago
Anna ▴ 140

hi, I am having the same problem to convert from gff to bed12 - since the latter is a requirement for coverageBed -split. any ideas? or should I write my own thing?

thanks!

Anna

ADD COMMENT
2
Entering edit mode

Sure, take a look at gff2bed, and the bedmap program from the same tool suite. No reason to convert to bed12 - the gff2bed script will fully convert all information for you to a BED-like format that bedmap can use to give you the coverage information you want.

ADD REPLY
0
Entering edit mode

@anna, you should delete this answer and re-post as a new question.

ADD REPLY
0
Entering edit mode

Could you please tell in more detail what problems you are facing? I have a script i wrote which does the same.

ADD REPLY

Login before adding your answer.

Traffic: 3314 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6