number of exons
1
0
Entering edit mode
2.8 years ago
anba • 0

Hi,

I need to obtain list with four columns

gene_name genomic_length max_number_of_exons max_number_of_introns

and I can't find this information in aggegation format. I need to find the maximum number of exoms and introns for each gene.

I tried ensembl and uscs, and there is only information about number of transcripts, but I need the length and number of exon/introns. Where should I obtain such a list?

database • 1.2k views
ADD COMMENT
0
Entering edit mode

Each gene has multiple transcripts which vary in length and or nu,ber of exons, so the table you describe either does not exist or is inherently incorrect.

ADD REPLY
0
Entering edit mode

It seems that I write it unclear. I need the genomic length (not transcripts) and the maximum number of exons/introns for each gene. I've corrected my question.

ADD REPLY
0
Entering edit mode
$ awk '$3=="exon" && $2 == "BestRefSeq" {print}'  GRCh38_latest_genomic.gtf| gffread -F --keep-exon-attrs --table "gene_id","transcript_id",@numexons | sort -k1,1 | head -20

A1BG    NM_130786.4 8
A1BG-AS1    NR_015380.2 4
A1CF    NM_001198818.2  14
A1CF    NM_001198819.2  15
A1CF    NM_001198820.2  14
A1CF    NM_001370130.1  12
A1CF    NM_001370131.1  12
A1CF    NM_014576.4 13
A1CF    NM_138932.3 13
A1CF    NM_138933.3 13
A2M NM_000014.6 36
A2M NM_001347423.2  37
A2M NM_001347424.2  36
A2M NM_001347425.2  35
A2M-AS1 NR_026971.1 3
A2M-AS1 NR_137424.1 2
A2M-AS1 NR_137425.1 2
A2ML1   NM_001282424.3  25
A2ML1   NM_144670.6 36
A2MP1   NR_040112.1 9

You can take the maximum, from third column:

$ awk '$3=="exon" && $2 == "BestRefSeq" {print}'  GRCh38_latest_genomic.gtf| gffread -F --keep-exon-attrs --table "gene_id","transcript_id",@numexons | sort -k1,1 | datamash -sg1 max 3 | head

A1BG    8
A1BG-AS1    4
A1CF    15
A2M 37
A2M-AS1 3
A2ML1   36
A2MP1   9
A3GALT2 5
A4GALT  3
A4GNT   3
ADD REPLY
5
Entering edit mode
2.8 years ago

max number of exons per gene:

$ wget -q  -O - "http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/wgEncodeGencodeBasicV19.txt.gz" | gunzip -c |\
    awk '{printf("%s\t%s\n",$13,$9);}' |\
    sort -t $'\t' -k1,1 -k2,2nr |\
    sort -t $'\t' -k1,1 -u --stable 

5S_rRNA 1
7SK 3
A1BG    8
A1BG-AS1    4
A1CF    15
A2M 36
A2M-AS1 3
A2ML1   36
A2ML1-AS1   2
A2ML1-AS2   2
A2MP1   9
A3GALT2 5
A4GALT  2
A4GNT   3
AAAS    16
AACS    18
AACSP1  11
AADAC   5
AADACL2 5
AADACL3 4
AADACL4 4
AADAT   14
AAED1   6
AAGAB   10
AAK1    22
AAMDC   7
AAMP    11
AANAT   7
AAR2    4
AARD    2
ADD COMMENT
1
Entering edit mode
$ datamash -sg13 max 9 < wgEncodeGencodeBasicV19.txt | head -20

5S_rRNA 1
7SK 3
A1BG    8
A1BG-AS1    4
A1CF    15
A2M 36
A2M-AS1 3
A2ML1   36
A2ML1-AS1   2
A2ML1-AS2   2
A2MP1   9
A3GALT2 5
A4GALT  2
A4GNT   3
AAAS    16
AACS    18
AACSP1  11
AADAC   5
AADACL2 5
AADACL3 4
ADD REPLY

Login before adding your answer.

Traffic: 1618 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6