A simple perl one-liner:
perl -lne 'if (/.+\texon\t.+gene_id "([^"]+)/) { $g{$1}++ }
END { foreach $i (sort keys %g) { print $i unless $g{$i} > 1 } }
' gencode.v26.annotation.gtf \
> intronless.txt
If you want to make sure that the code works, you can have both intermediate (exoncounts.txt) and final results (intronless.txt) to check manually the exon count:
perl -lne 'if (/.+\texon\t.+gene_id "([^"]+)/) { $g{$1}++ }
END { foreach $i (sort keys %g) { print "$i\t$g{$i}" } }
' gencode.v26.annotation.gtf | tee exoncounts.txt \
| perl -lane 'print $F[0] if $F[1]==1' > intronless.txt
Taking the same exon counting rationale into this grep | cut | sort | uniq | perl
combination is even faster:
grep exon gencode.v26.annotation.gtf | cut -d'"' -f2 | sort | uniq -c \
| tee exoncounts.txt | perl -lane '$F[0] == 1 and print $F[1]' > intronless.txt
Important note: just realized that CDS lines do get through the previous grep, plus that HAVANA and ENSEMBL annotations may be redundant (therefore same exons could be counted twice), so the code should consider those issues in order to generate the proper output:
awk '{ if ($3=="exon") print $1, $4, $5, $10 }' gencode.v26.annotation.gtf | sort -u \
| cut -d'"' -f2 | sort | uniq -c | perl -lane '$F[0] == 1 and print $F[1]' > intronless.txt
Explanation: awk
selects exon lines and prints only chromosome, start, end and geneid, sort -u
collapses redundant exons, cut -d'"' -f2
reduces output to geneids only, sort | uniq -c
collapses same geneids while counting them, and perl
prints geneids containing 1 exon only.
Counting exons, smart one!
Thank you. Love considering different approaches to the same issue.
Lovely approach Jorge, many thanks!
If an answer was helpful, you should upvote it; if the answer resolved your question, you should mark it as accepted. You can accept more than one if they work.