Locating longest gene, CDS in Ensembl GTF
2
2
Entering edit mode
7.4 years ago
oars ▴ 200

I'm creating a series of reports based off the Ensembl FTP (GTG) download (Homo_sapiens.GRCh38.89.gtf.gz).

I have a script that allows me to create a report with the number of features, total size of features, and average size of features (for each feature type), my script is below:

$ gunzip -c Homo_sapiens.GRCh38.89.gtf.gz |grep -v ^#|awk 'BEGIN{print "Number\tTotal\tAverage"} $3~/gene/{x=x+($5-$4);n++} END{print n"\t"x"\t"x/n}'


Number Total Average
 58233 1775475403 30489.2

However, now I am trying to write a report that identifies the longest gene, the longest CDS within the Homo_sapiens.GRCh38.89.gtf.gz file. I know I'll feel embarrassed once I see how it is written but I'm really stuck. Any help with creating a script in bash is much appreciated.

Ensembl GTF longest gene CDS • 3.3k views
ADD COMMENT
1
Entering edit mode

Do you want the longest transcript in terms of DNA length or RNA length? There's likely a large difference between the two. Note that you should just use something like python to make your life easier.

ADD REPLY
1
Entering edit mode

Longest gene:

 gzip -dc Homo_sapiens.GRCh38.86.gtf.gz | awk 'OFS="\t" {if ($3=="gene") {print $14,($5-$4),$3,$2}}' | tr -d '";' | datamash -s -g 1 max 2 | sort -nr -k2 | head

Output (gene symbol and length):

CNTNAP2 2304996
PTPRD   2298477
DMD 2241764
DLG2    2172910
CSMD1   2059619
MACROD2 2057828
EYS 1987245
LRP1B   1900278
PCDH15  1825171
CTNNA3  1783651

Gene with longest transcript and exons within transcript:

gzip -dc Homo_sapiens.GRCh38.86.gtf.gz | awk 'OFS="\t" {if ($3=="exon") {print $20,$14,$3,$22}}' | tr -d '";"'| grep -w "ensembl" | datamash -g 1,2 count 3 | sort -k3 -nr | head

output (by gene symbol, transcripts and number of exons in each transcript)

TTN ENST00000615779 312
TTN ENST00000359218 191
TTN ENST00000342175 191
WDFY4   ENST00000325239 61
AC026348.1  ENST00000393845 34
AC096949.1  ENST00000361522 32
AC005338.1  ENST00000353284 29
AC013264.1  ENST00000282272 28
AC090094.1  ENST00000541428 25
AL356053.1  ENST00000375025 23

datamash is gnu statistics library and available in all linux distro repos.

ADD REPLY
0
Entering edit mode

Many thanks, as you can see in my post below, I also came up with the 'CNTNAP2' as the longest gene.

Very respectfully,

ADD REPLY
2
Entering edit mode
7.4 years ago
aka001 ▴ 190

Very pragmatic approach that will do what you wanted:

zcat Homo_sapiens.GRCh38.89.gtf.gz | awk -F'\t' -v OFS='\t' '$3~/gene/{print $0, $5-$4}' | sort -t $'\t' -k10,10nr | head -n 1

You can change 'gene' to 'CDS' to get the longest CDS and modify as you like for your report.

ADD COMMENT
1
Entering edit mode

Thank you for this. This is what I used for my command:

gunzip -c /Homo_sapiens.GRCh38.89.gtf.gz' |awk -F'\t' -v OFS='\t' '$3~/gene/{print $0, $5-$4}'|sort -t $'\t' -k10,10nr | head -n 1

AND THE OUTPUT 7 ensembl_havana gene 146116002 148420998 . + . gene_id "ENSG00000174469"; gene_version "19"; gene_name "CNTNAP2"; gene_source "ensembl_havana"; gene_biotype "protein_coding"; 2304996

I think I understand most of your code, below is my understanding:

zcat is basically same as gunzip --c

awk -F'\t' -v OFS ==> split files on tabs, OFS is for output field

$3~/gene/ ==> 3rd column number (i.e., feature type), we specify /gene/

{print $0, $5-$4} ==> $0 is for the entire input line, $5 is the end cord. position, $4 is the start cord. position

sort -t ==> sort considering a delimiter (in this case a pipe)

$'\t' -k10,10nr | ==> I'm really lost on this section. I believe the \t' is to split files on tabs? The nr is for number of rows?

head -n 1 ==> head outputting text?

How did I do?

ADD REPLY
0
Entering edit mode

You got most of the concepts right.

As you said, -t flag in sort command is used to specify the delimiter, but in this case it's not a pipe, but rather a $'\t', which would be considered as tab (since gtf file has some characters that can be considered as delimiters). For -nr flag, it's basically telling us to sort numerically (-n) and reverse it (-r). Head -n 1 is giving you the first line of the output (which in this case has the longest gene, i.e. having the largest $5-$4 result) and if you want for example take 20 longest genes, you can just specify -n 20 instead.

You can also add cut -f9 to get only the 9th column and then extract it as you want (i.e. if you want to take the gene_id instead of the gene_name).

ADD REPLY
0
Entering edit mode
7.4 years ago
oars ▴ 200

Awesome, thank you for this explanation! I'm also attempting to locate 'exon’ and ‘CDS’ features with exactly the same coordinates. I know what I'm looking for but cannot get the command to execute properly. In summary, I want to retrieve from field 3 ($3) the exon and CDS feature-types with the same coordinates, these are found in column 4 (start location) and column 5 (end location), syntax is $4 and $5. My hypothetically code looks something like this:

gunzip -c Homo_sapiens.GRCh38.80.gtf.gz |grep -v ^#|awk 'BEGIN{print '$3~/exon/CDS/{x=("exon" $4 == "CDS" $4) | x=("exon" $5 == "CDS" $5)}'

But of course this will not execute, do you know why?

ADD COMMENT
0
Entering edit mode

I think you should have put this as a follow up comment or even a new question.

Nonetheless, for that problem I would actually do a very pragmatic move, i.e. extract all the CDS and the exons to two separate files, then compare the 4th and 5th columns of the two files.

You can then just use this command to get the matches:

awk 'NR==FNR{a[$4,$5];next} ($4,$5) in a' file_1 file_2
ADD REPLY
0
Entering edit mode

Thanks again. After reading your feedback and thinking more about it I came up with the following solution:

$ gunzip -c /Homo_sapiens.GRCh38.89.gtf.gz' |grep -v ^#|awk '$3~/exon/{print$0}'|cut -f4|sort > output_exon.txt

$ gunzip -c /Homo_sapiens.GRCh38.89.gtf.gz' |grep -v ^#|awk '$3~/CDS/{print$0}'|cut -f4|sort > output_CDS.txt

$ cat output_exon.txt output_CDS.txt|sort|uniq -d|wc -l

260396

$ gunzip -c /Homo_sapiens.GRCh38.89.gtf.gz' |grep -v ^#|awk '$3~/exon/{print$0}'|cut -f5|sort > output_exon5.txt

$ gunzip -c /Homo_sapiens.GRCh38.89.gtf.gz' |grep -v ^#|awk '$3~/CDS/{print$0}'|cut -f5|sort > output_CDS5.txt

$ cat output_exon5.txt output_CDS5.txt|sort|uniq -d|wc -l

259588

ADD REPLY
0
Entering edit mode

grep -v is not necessary.

ADD REPLY
0
Entering edit mode

Good catch - I'm guessing this is because I am only "cutting" column 4 (or 5) thereby making the header irrelevant?

ADD REPLY
0
Entering edit mode

Yes. Further even, awk is not necessary as it is extracting all rows with word "exon" in it and zgrep can do it.. A shorter code is as below compared to code posted above: ( I am not sure if it is any faster):

Code pasted from above (except for input):

 $ gunzip -c Homo_sapiens.GRCh38.86.gtf.gz |grep -v ^#|awk '$3~/exon/{print$0}'|cut -f5|sort | head -2

Output:

100000739
100000996

shorter code:

$ zgrep -i exon Homo_sapiens.GRCh38.86.gtf.gz | cut -f5 | sort | head -2

output:

100000739
100000996
ADD REPLY
0
Entering edit mode

I don't get it why you only cut the col 4 or 5? I thought you wanted to do:

I'm also attempting to locate 'exon’ and ‘CDS’ features with exactly the same coordinates.

In summary, I want to retrieve from field 3 ($3) the exon and CDS feature-types with the same coordinates

If you only extract the coordinates without any other information (i.e. gene name or id), then how would you know which 'exon' or which 'CDS' are they?

ADD REPLY
0
Entering edit mode

I might have this wrong but in my thinking the $3~/exon/{print$0} and $3~/CDS/{print$0} coupled with the cut -f4 (and separately cut -f5) would provide me lists of star/stop cords for exons and CDS. The I used the...

cat output_exon.txt output_CDS.txt|sort|uniq -d|wc -l

...to find coordinate matches. It seems like this was an unwise choice?

ADD REPLY
0
Entering edit mode

Right, so instead of knowing what are those features (like you stated before), that is the id or name of those exons/CDSs, you just wanted to know how many of them? If it's so, you're good then.

However, I still think that probably there are some exons that have the same start (or end) coordinates but not the end (or start) coordinates, like in the case of alternative splicing. In that case, I don't know if you can just divide your number by two (like you assume that there are only one exon and one CDS that are matched).

ADD REPLY

Login before adding your answer.

Traffic: 1333 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6