Intronless genes in the human genome annotation
2
2
Entering edit mode
4.0 years ago

Hi everyone,

I was wondering if anyone is familiar of any annotation term in the human genome annotation e.g. from gencode or ensembl to be able to extract intronless genes and separate them from genes containing introns.

There is probably an automated way to extract intronless genes from the exon annotations in the gtf files. If there are any initial thoughts on this it would be much appreciated

Thanks in advance,

Sergio

intronless genes hg38 GRCh38 annotation • 1.4k views
ADD COMMENT
9
Entering edit mode
4.0 years ago

A simple perl one-liner:

perl -lne 'if (/.+\texon\t.+gene_id "([^"]+)/) { $g{$1}++ }
END { foreach $i (sort keys %g) { print $i unless $g{$i} > 1 } }
' gencode.v26.annotation.gtf \
> intronless.txt

If you want to make sure that the code works, you can have both intermediate (exoncounts.txt) and final results (intronless.txt) to check manually the exon count:

perl -lne 'if (/.+\texon\t.+gene_id "([^"]+)/) { $g{$1}++ }
END { foreach $i (sort keys %g) { print "$i\t$g{$i}" } }
' gencode.v26.annotation.gtf | tee exoncounts.txt \
| perl -lane 'print $F[0] if $F[1]==1' > intronless.txt

Taking the same exon counting rationale into this grep | cut | sort | uniq | perl combination is even faster:

grep exon gencode.v26.annotation.gtf | cut -d'"' -f2 | sort | uniq -c \
| tee exoncounts.txt | perl -lane '$F[0] == 1 and print $F[1]' > intronless.txt

Important note: just realized that CDS lines do get through the previous grep, plus that HAVANA and ENSEMBL annotations may be redundant (therefore same exons could be counted twice), so the code should consider those issues in order to generate the proper output:

awk '{ if ($3=="exon") print $1, $4, $5, $10 }' gencode.v26.annotation.gtf | sort -u \
| cut -d'"' -f2 | sort | uniq -c | perl -lane '$F[0] == 1 and print $F[1]' > intronless.txt

Explanation: awk selects exon lines and prints only chromosome, start, end and geneid, sort -u collapses redundant exons, cut -d'"' -f2 reduces output to geneids only, sort | uniq -c collapses same geneids while counting them, and perl prints geneids containing 1 exon only.

ADD COMMENT
0
Entering edit mode

Counting exons, smart one!

ADD REPLY
0
Entering edit mode

Thank you. Love considering different approaches to the same issue.

ADD REPLY
0
Entering edit mode

Lovely approach Jorge, many thanks!

ADD REPLY
1
Entering edit mode

If an answer was helpful, you should upvote it; if the answer resolved your question, you should mark it as accepted. You can accept more than one if they work.
Upvote|Bookmark|Accept

ADD REPLY

Login before adding your answer.

Traffic: 2218 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6