Dear All, How would get a list with number exons for all non-coding genes? Any suggestion would be really great Thank you
Dear All, How would get a list with number exons for all non-coding genes? Any suggestion would be really great Thank you
This I made it in very rough way. But it works. The below command will give you number of exons per transcript of non-coding regions from Ensembl GTF file. http://ftp.ensemblorg.ebi.ac.uk/pub/release-74/gtf/homo_sapiens/Homo_sapiens.GRCh37.74.gtf.gz
Create Exon numbers file with transcript ID:
zcat Homo_sapiens.GRCh37.74.gtf.gz | grep -v "protein_coding" | cut -f9 | sed 's/gene_id\ \"//' | sed 's/transcript_id\ \"//' | sed 's/exon_number\ \"//' | sed 's/gene_name\ \"//' | sed 's/gene_biotype\ \"//' | sed 's/\"\;/\t/g' | awk '{print $2"\t"$1"\t"$4"\t"$5;}' | awk 'BEGIN{FS="\t"}{ if( !seen[$1]++ ) order[++oidx] = $1; stuff[$1] = stuff[$1] $2 "," } END { for( i = 1; i <= oidx; i++ ) print order[i]"\t"stuff[order[i]] }' | awk -v FS="" 'BEGIN{print"Transcript\tExons"}{cnt=0;for (i=1;i<=NF;i++) if ($i==",") cnt++; print $0"\t"cnt}' | cut -f1,3 > transcript_with_numberOfExons.out
Create annotation file with transcript ID:
zcat Homo_sapiens.GRCh37.74.gtf.gz | cut -f9 | sed 's/gene_id\ \"//' | sed 's/transcript_id\ \"//' | sed 's/exon_number\ \"//' | sed 's/gene_name\ \"//' | sed 's/gene_biotype\ \"//' | sed 's/\"\;/\t/g' | awk '{print $2"\t"$1"\t"$4"\t"$5;}' | awk '!x[$0]++' > Ensembl_ENST_annotation.table
Merge the files to have complete transcript information:
join -j1 <(sort Ensembl_ENST_annotation.table) <(sort transcript_with_numberOfExons.out) -t $'\t' > Complete_annotation_exonNumbers.out
Final Output: (Last column is exon numbers for transcripts from first column)
ENST00000021776 ENSG00000020219 CCT8L1P pseudogene 1
ENST00000162863 ENSG00000186704 DTX2P1 pseudogene 9
ENST00000216407 ENSG00000256075 RP11-81H14.3 pseudogene 1
ENST00000219758 ENSG00000103472 RRN3P2 pseudogene 11
ENST00000221169 ENSG00000104725 NEFL processed_transcript 4
ENST00000222301 ENSG00000105694 TCEB1P28 pseudogene 2
ENST00000222396 ENSG00000226472 RP11-551L14.4 lincRNA 4
ENST00000223076 ENSG00000184414 RP11-44M6.3 pseudogene 1
ENST00000225587 ENSG00000108452 ZNF29P pseudogene 1
ENST00000225973 ENSG00000267267 CTD-3199J23.4 lincRNA 1
JFI: Single gene (ENSG) can have multiple transcripts/isoforms (ENSTs)
using mysql UCSC/knownGene
$ mysql --user=genome --host=genome-mysql.soe.ucsc.edu -A -D hg19 -e 'select chrom,txStart,txEnd,name,strand,exonCount from knownGene where cdsStart=cdsEnd '
+-------+---------+-------+------------+--------+-----------+
| chrom | txStart | txEnd | name | strand | exonCount |
+-------+---------+-------+------------+--------+-----------+
| chr1 | 11873 | 14409 | uc001aaa.3 | + | 3 |
| chr1 | 11873 | 14409 | uc010nxr.1 | + | 3 |
| chr1 | 14361 | 16765 | uc009vis.3 | - | 4 |
| chr1 | 14361 | 19759 | uc009vit.3 | - | 9 |
| chr1 | 14361 | 19759 | uc009viu.3 | - | 10 |
| chr1 | 14361 | 19759 | uc001aae.4 | - | 10 |
| chr1 | 14361 | 29370 | uc001aah.4 | - | 11 |
| chr1 | 14361 | 29370 | uc009vir.3 | - | 10 |
| chr1 | 14361 | 29370 | uc009viq.3 | - | 7 |
| chr1 | 14361 | 29370 | uc001aac.4 | - | 11 |
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
It's unclear what you want to obtain, please elaborate. Ideal questions also have examples of input and output.
Non-coding genes? Do you mean pseudogenes? or non-coding transcripts?