I'm creating a series of reports based off the Ensembl FTP (GTG) download (Homo_sapiens.GRCh38.89.gtf.gz).
I have a script that allows me to create a report with the number of features, total size of features, and average size of features (for each feature type), my script is below:
$ gunzip -c Homo_sapiens.GRCh38.89.gtf.gz |grep -v ^#|awk 'BEGIN{print "Number\tTotal\tAverage"} $3~/gene/{x=x+($5-$4);n++} END{print n"\t"x"\t"x/n}'
Number Total Average 58233 1775475403 30489.2
However, now I am trying to write a report that identifies the longest gene, the longest CDS within the Homo_sapiens.GRCh38.89.gtf.gz file. I know I'll feel embarrassed once I see how it is written but I'm really stuck. Any help with creating a script in bash is much appreciated.
Do you want the longest transcript in terms of DNA length or RNA length? There's likely a large difference between the two. Note that you should just use something like python to make your life easier.
Longest gene:
Output (gene symbol and length):
Gene with longest transcript and exons within transcript:
output (by gene symbol, transcripts and number of exons in each transcript)
datamash is gnu statistics library and available in all linux distro repos.
Many thanks, as you can see in my post below, I also came up with the 'CNTNAP2' as the longest gene.
Very respectfully,