Entering edit mode
10.2 years ago
jack
▴
980
Hi,
I need to parse human GTF file for my work. I downloaded it from Ensembl.
Basically I don't know what is the number "1" means at the beginning of some lines ?
also if one gene codes more than one transcript, how can I find it ?
Here is the first few lines of it :
1 havana gene 11869 14409 . + . gene_id "ENSG00000223972"; gene_version "5"; gene_name "DDX11L1"; gene_source "havana"; gene_biotype "transcribed_unprocessed_pseudogene";
1 havana transcript 11869 14409 . + . gene_id "ENSG00000223972"; gene_version "5"; transcript_id "ENST00000456328"; transcript_version "2"; gene_name "DDX11L1"; gene_source "havana"; gene_biotype "transcribed_unprocessed_pseudogene"; transcript_name "DDX11L1-002"; transcript_source "havana"; transcript_biotype "processed_transcript";
1 havana exon 11869 12227 . + . gene_id "ENSG00000223972"; gene_version "5"; transcript_id "ENST00000456328"; transcript_version "2"; exon_number "1"; gene_name "DDX11L1"; gene_source "havana"; gene_biotype "transcribed_unprocessed_pseudogene"; transcript_name "DDX11L1-002"; transcript_source "havana"; transcript_biotype "processed_transcript"; exon_id "ENSE00002234944"; exon_version "1";
1 havana exon 12613 12721 . + . gene_id "ENSG00000223972"; gene_version "5"; transcript_id "ENST00000456328"; transcript_version "2"; exon_number "2"; gene_name "DDX11L1"; gene_source "havana"; gene_biotype "transcribed_unprocessed_pseudogene"; transcript_name "DDX11L1-002"; transcript_source "havana"; transcript_biotype "processed_transcript"; exon_id "ENSE00003582793"; exon_version "1";
I see, but seems bit messy. I want to read this file to create a table with 3 columns, like Gene name, corresponding transcript and transcription start site. but this file is not understandable for me.
Welcome to bioinformatics - we feel your pain, this is your support group
There's no reason to bother parsing a GTF file for that, just use Biomart. Just click on the results on that link and save them to a file, since I already created the query for you.
N.B., that link in my last comment had the gene_id rather than the name. Just switch that in the "attributes" section and then get the results of that. Mea culpa!
I got it. Thanks. The other thing I'm looking is that I want to add the 3'UTR sequence of the transcript to the file that I export from Biomart. How can I do this? I looked to the attributes, but I couldn't find such option.
I found it, and done :-)
I should have refreshed before replying :)
When you extract sequences, you get the results in a fasta file. What it does is place the other attributes you wanted (e.g., transcript ID) in the header for each line and it'll separate multiple attributes with a pipe ("|"). That's convenient enough to parse (really easy in biopython) if you really do want things in columns.
I see, but how can I find the transcripts of the a given gene that have different 3' UTR sequnces? for example, if a given gene encode 4 transcripts, and 2 of them might have different 3' UTR sequences, I want to put these two transcripts together which have different 3'UTR sequences. I want to do this for all transcriptom.
Either post-process the biomart output to aggregate sequences by gene and compare them or go back to parsing the GTF file and compare the coordinates of everything past the stop codon (be sure to take strand into account!). I expect that the former is easier.
What you say is too general. The point is that, how can I compare the sequences? And also if the transcripts of a given genes have different coordinate, does it necessarily imply that they have different 3' UTR sequences?