Question

Collecting 3' UTR and neighboring exonic sequence from GENCODE GTF files

0

Entering edit mode

8.6 years ago

mdwain.uw • 0

Hi,

I performed RNA sequencing using a poly A 3' tagging/sequencing approach. I therefore expect the sensible reads to map to only the 3' end of the transcripts in my sample. I want to subset the GENCODE gene definitions to only include the UTR + 1kb of exon. What is the best way to do this? I had the following ideas:

1) "grep" the GENCODE definition files for "UTR' lines, then find exons whose coordinates are immediately adjacent, and keep going backwards (to get more neighboring exons) until i get my 1kb.

2) "grep" the GENCODE files for "stop_codon" lines, then keep getting exons whose coordinates are immediately adjacent to the "stop_codon" coordinates, until I get my 1kb.

3) find the "transcript" lines of the GENCODE file, try to match them to the "UTR" lines, then select the last 1kb of the 'transcript' definitions (and add on the coordinates for the UTR).

besides trying to figure out what the best way to get these 3' end coordinates, I also had the following question:

1) should all transcript definitions have a "UTR" line in the GENCODE definition files? 2) should all "UTR" definitions have adjacent "exons"?

Thanks!

GENCODE GTF UTR • 2.9k views

ADD COMMENT • link updated 4.7 years ago by Biostar 20 • written 8.6 years ago by mdwain.uw • 0

0

Entering edit mode

My experience is that UTR annotation is not as good as you would hope. Are you using Lexogen Quantseq by any chance?

ADD REPLY • link 8.6 years ago by WouterDeCoster 47k

0

Entering edit mode

nope, new tech. What would you suggest?

ADD REPLY • link 8.6 years ago by mdwain.uw • 0

0

Entering edit mode

Nothing conclusive yet, but I'm trying things like extending my UTR sequences (1kb) starting from the stop codon... (my sequencing is stranded so that's quite safe).

ADD REPLY • link 8.6 years ago by WouterDeCoster 47k

score 0 · Answer 1 · 2016-04-18

Not sure about the best tactics, but I can tell you about GENCODE genes:

1) should all transcript definitions have a "UTR" line in the GENCODE definition files?

No. UTRs are only annotated if there is evidence for the UTRs for that transcripts. Many transcripts are annotated based only on protein data, so no UTRs. There are also loads of non-coding transcripts, which of course have no UTRs.

2) should all "UTR" definitions have adjacent "exons"?

The UTRs are part of the exons. If there is a UTR, it should have an adjacent CDS.