In my gff file, how do I select and generate a list of genes with more than one exons in the 3'UTR region?
In my gff file, how do I select and generate a list of genes with more than one exons in the 3'UTR region?
using a GTF and bioalcidaejdk : http://lindenb.github.io/jvarkit/BioAlcidaeJdk.html
java -jar dist/bioalcidaejdk.jar -F GTF -e 'stream().flatMap(G->G.getTranscripts().stream()).filter(T->T.getExonCount()>1 && T.getTranscriptUTR3().isPresent()).map(T->T.getTranscriptUTR3().get()).filter(UTR->UTR.getIntervals().size()>1).forEach(U->println(U.getTranscript().getGene().getId()+" "+U.getTranscript().getGene().getGeneName()+" "+U.getTranscript().getId()+" "+U.getIntervals().size()));' chr22.gtf.gz | sort -t ' ' -k4,4n
(...)
ENSG00000093009.11 CDC45 ENST00000438587.6 17
ENSG00000100412.17 ACO2 ENST00000676714.1 17
ENSG00000100412.17 ACO2 ENST00000678819.1 17
ENSG00000100429.18 HDAC10 ENST00000626012.2 17
ENSG00000184381.20 PLA2G6 ENST00000668499.1 17
ENSG00000286070.2 AP000356.5 ENST00000652248.1 17
ENSG00000099949.21 LZTR1 ENST00000642151.1 19
ENSG00000100023.20 PPIL2 ENST00000680434.1 19
ENSG00000100106.22 TRIOBP ENST00000344404.10 19
ENSG00000100325.15 ASCC2 ENST00000458594.5 19
ENSG00000100412.17 ACO2 ENST00000677698.1 19
ENSG00000100023.20 PPIL2 ENST00000417788.5 20
ENSG00000100150.20 DEPDC5 ENST00000642771.1 20
ENSG00000100150.20 DEPDC5 ENST00000645494.1 20
ENSG00000242259.9 C22orf39 ENST00000509549.5 21
ENSG00000254413.8 CHKB-CPT1B ENST00000453634.5 21
ENSG00000100150.20 DEPDC5 ENST00000644162.1 24
ENSG00000100150.20 DEPDC5 ENST00000645755.1 28
ENSG00000284431.1 AL022238.3 ENST00000639722.1 29
ENSG00000133454.16 MYO18B ENST00000539302.5 32
ENSG00000100150.20 DEPDC5 ENST00000642684.1 37
using gffutils and datamash tools and parsing gtf from : https://raw.githubusercontent.com/csoneson/jcc/master/inst/extdata/Homo_sapiens.GRCh38.90.chr22.gtf.gz
$ gtf_extract -f three_prime_utr --fields gene_id,gene_name,transcript_id,feature Homo_sapiens.GRCh38.90.chr22.gtf | datamash -f -s -g2,3 count 3 | awk -v OFS="\t" 'NR ==1 {print "Gene_ID", "Gene_symbol","transcript", "3\047_UTR_Count"}; $5 >1 {print $1,$2,$3,$5}'
Gene_ID Gene_symbol transcript 3'_UTR_Count
ENSG00000283809 AC007326.4 ENST00000638240 2
ENSG00000099889 ARVCF ENST00000263207 2
ENSG00000099968 BCL2L13 ENST00000399777 2
ENSG00000099968 BCL2L13 ENST00000498133 3
ENSG00000015475 BID ENST00000342111 3
ENSG00000242259 C22orf39 ENST00000509549 21
ENSG00000093009 CDC45 ENST00000263201 2
ENSG00000093009 CDC45 ENST00000404724 2
ENSG00000093009 CDC45 ENST00000407835 2
ENSG00000093009 CDC45 ENST00000428937 5
ENSG00000093009 CDC45 ENST00000437685 2
ENSG00000099954 CECR2 ENST00000355219 3
ENSG00000070371 CLTCL1 ENST00000427926 2
ENSG00000070371 CLTCL1 ENST00000458188 4
ENSG00000070371 CLTCL1 ENST00000538828 2
ENSG00000070371 CLTCL1 ENST00000617103 8
ENSG00000070371 CLTCL1 ENST00000617926 2
ENSG00000070371 CLTCL1 ENST00000621271 2
ENSG00000070371 CLTCL1 ENST00000622493 2
ENSG00000093010 COMT ENST00000207636 2
ENSG00000183628 DGCR6 ENST00000427407 3
ENSG00000183628 DGCR6 ENST00000480608 4
ENSG00000183628 DGCR6 ENST00000483718 3
ENSG00000100056 ESS2 ENST00000434568 4
ENSG00000277870 FAM230A ENST00000624459 5
ENSG00000215568 GAB4 ENST00000465611 6
ENSG00000100084 HIRA ENST00000452818 4
ENSG00000243156 MICAL3 ENST00000495076 13
ENSG00000215193 PEX26 ENST00000474897 4
ENSG00000198062 POTEH ENST00000343518 2
ENSG00000198062 POTEH ENST00000452800 9
ENSG00000198062 POTEH ENST00000621704 2
ENSG00000184702 SEPT5 ENST00000406172 2
ENSG00000184702 SEPT5 ENST00000406395 2
ENSG00000184702 SEPT5 ENST00000431044 2
ENSG00000184702 SEPT5 ENST00000438754 2
ENSG00000184702 SEPT5 ENST00000455843 2
ENSG00000184058 TBX1 ENST00000359500 2
ENSG00000184470 TXNRD2 ENST00000400518 2
ENSG00000184470 TXNRD2 ENST00000400521 2
ENSG00000184470 TXNRD2 ENST00000400525 2
ENSG00000184470 TXNRD2 ENST00000462330 2
ENSG00000184470 TXNRD2 ENST00000474308 2
ENSG00000184470 TXNRD2 ENST00000485358 2
ENSG00000184470 TXNRD2 ENST00000542719 2
Data mash is present in most of the system repos in linux distros.
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Thank you Juke, it worked well!
Thank you for sharing this interesting tool, rather set of tools. However, this tool seems to have huge requirements. For eg. on ubuntu, requirements need around 894 MB hdd space. May be developers dumb it down to manageable requirements. Mamba fails to install this tool.
Did you try in a fresh environment? You can skip R if you wish to save some space. It is used just to perform some plots. Installing R-base with conda is 550Mb on OSX, AGAT use R as dependencies and is in total 955 MB on OSX.