Entering edit mode
5.5 years ago
waqaskhokhar999
▴
160
I want to get count of multi-exonic genes from the stringtie assembled gtf file of Arabidopsis genome, For example, transcript ( transcript_id "MSTRG.1.2") of gene (gene_id "MSTRG.1") contains 6 exons (exon_number "1", exon_number "2", exon_number "3", exon_number "4", exon_number "5", exon_number "6") while transcript ( transcript_id "MSTRG.2.1") of gene (gene_id "MSTRG.2") contains 1 exon only (exon_number "1"). The output should be like this:
gene_id t_name num_exons
MSTRG.1 MSTRG.1.2 6
MSTRG.1 MSTRG.1.3 5
MSTRG.2 MSTRG.2.1 1
I have checked this link, but in this link format of gtf file is different.
Sample input:
1 StringTie transcript 3651 5899 1000 + . gene_id "MSTRG.1"; transcript_id "MSTRG.1.2";
1 StringTie exon 3651 3913 1000 + . gene_id "MSTRG.1"; transcript_id "MSTRG.1.2"; exon_number "1";
1 StringTie exon 3996 4276 1000 + . gene_id "MSTRG.1"; transcript_id "MSTRG.1.2"; exon_number "2";
1 StringTie exon 4506 4605 1000 + . gene_id "MSTRG.1"; transcript_id "MSTRG.1.2"; exon_number "3";
1 StringTie exon 4706 5095 1000 + . gene_id "MSTRG.1"; transcript_id "MSTRG.1.2"; exon_number "4";
1 StringTie exon 5174 5326 1000 + . gene_id "MSTRG.1"; transcript_id "MSTRG.1.2"; exon_number "5";
1 StringTie exon 5439 5899 1000 + . gene_id "MSTRG.1"; transcript_id "MSTRG.1.2"; exon_number "6";
1 StringTie transcript 3657 5899 1000 + . gene_id "MSTRG.1"; transcript_id "MSTRG.1.3";
1 StringTie exon 3657 3913 1000 + . gene_id "MSTRG.1"; transcript_id "MSTRG.1.3"; exon_number "1";
1 StringTie exon 3996 4276 1000 + . gene_id "MSTRG.1"; transcript_id "MSTRG.1.3"; exon_number "2";
1 StringTie exon 4486 5095 1000 + . gene_id "MSTRG.1"; transcript_id "MSTRG.1.3"; exon_number "3";
1 StringTie exon 5174 5326 1000 + . gene_id "MSTRG.1"; transcript_id "MSTRG.1.3"; exon_number "4";
1 StringTie exon 5439 5899 1000 + . gene_id "MSTRG.1"; transcript_id "MSTRG.1.3"; exon_number "5";
1 StringTie transcript 15498 15756 1000 . . gene_id "MSTRG.2"; transcript_id "MSTRG.2.1";
1 StringTie exon 15498 15756 1000 . . gene_id "MSTRG.2"; transcript_id "MSTRG.2.1"; exon_number "1";
1 StringTie transcript 6788 11170 1000 - . gene_id "MSTRG.3"; transcript_id "MSTRG.3.1";
your question is not clear. FYI most genes (in human) are multi-exonic.. Could you clarify your question please?
What you have to do is to isolate the gene_id and transcript_id part, e.g. using
awk
and then count e.g. usinguniq -c
. I strongly suggest you try to solve this yourself using google as this really improves an essential skill in bioinformatics =>data sanitation
.