Question

How to select genes with multi-exon 3'UTR

0

Entering edit mode

4.1 years ago

tianshenbio ▴ 180

In my gff file, how do I select and generate a list of genes with more than one exons in the 3'UTR region?

rna-seq genome gff gene • 2.3k views

ADD COMMENT • link updated 4.1 years ago by Juke34 9.2k • written 4.1 years ago by tianshenbio ▴ 180

score 3 · Answer 1 · 2021-04-05

3

Entering edit mode

4.1 years ago

Juke34 9.2k

from AGAT:
agat_sp_manage_UTRs.pl --gff input.gff --three -n 2 --out result_folder

It will create 2 gff files, one will all gene with 3'UTR made of >= 2 exons and another one with 3'UTR made of < 2 exons

ADD COMMENT • link 4.1 years ago by Juke34 9.2k

1

Entering edit mode

Thank you Juke, it worked well!

ADD REPLY • link 4.1 years ago by tianshenbio ▴ 180

0

Entering edit mode

Thank you for sharing this interesting tool, rather set of tools. However, this tool seems to have huge requirements. For eg. on ubuntu, requirements need around 894 MB hdd space. May be developers dumb it down to manageable requirements. Mamba fails to install this tool.

ADD REPLY • link 4.1 years ago by cpad0112 21k

0

Entering edit mode

Did you try in a fresh environment? You can skip R if you wish to save some space. It is used just to perform some plots. Installing R-base with conda is 550Mb on OSX, AGAT use R as dependencies and is in total 955 MB on OSX.

ADD REPLY • link 4.1 years ago by Juke34 9.2k

score 2 · Answer 2 · 2021-04-03

using a GTF and bioalcidaejdk : http://lindenb.github.io/jvarkit/BioAlcidaeJdk.html

 java -jar dist/bioalcidaejdk.jar -F GTF -e 'stream().flatMap(G->G.getTranscripts().stream()).filter(T->T.getExonCount()>1 && T.getTranscriptUTR3().isPresent()).map(T->T.getTranscriptUTR3().get()).filter(UTR->UTR.getIntervals().size()>1).forEach(U->println(U.getTranscript().getGene().getId()+" "+U.getTranscript().getGene().getGeneName()+" "+U.getTranscript().getId()+" "+U.getIntervals().size()));'  chr22.gtf.gz | sort -t ' ' -k4,4n

(...)
ENSG00000093009.11 CDC45 ENST00000438587.6 17
ENSG00000100412.17 ACO2 ENST00000676714.1 17
ENSG00000100412.17 ACO2 ENST00000678819.1 17
ENSG00000100429.18 HDAC10 ENST00000626012.2 17
ENSG00000184381.20 PLA2G6 ENST00000668499.1 17
ENSG00000286070.2 AP000356.5 ENST00000652248.1 17
ENSG00000099949.21 LZTR1 ENST00000642151.1 19
ENSG00000100023.20 PPIL2 ENST00000680434.1 19
ENSG00000100106.22 TRIOBP ENST00000344404.10 19
ENSG00000100325.15 ASCC2 ENST00000458594.5 19
ENSG00000100412.17 ACO2 ENST00000677698.1 19
ENSG00000100023.20 PPIL2 ENST00000417788.5 20
ENSG00000100150.20 DEPDC5 ENST00000642771.1 20
ENSG00000100150.20 DEPDC5 ENST00000645494.1 20
ENSG00000242259.9 C22orf39 ENST00000509549.5 21
ENSG00000254413.8 CHKB-CPT1B ENST00000453634.5 21
ENSG00000100150.20 DEPDC5 ENST00000644162.1 24
ENSG00000100150.20 DEPDC5 ENST00000645755.1 28
ENSG00000284431.1 AL022238.3 ENST00000639722.1 29
ENSG00000133454.16 MYO18B ENST00000539302.5 32
ENSG00000100150.20 DEPDC5 ENST00000642684.1 37

score 2 · Answer 3 · 2021-04-03

using gffutils and datamash tools and parsing gtf from : https://raw.githubusercontent.com/csoneson/jcc/master/inst/extdata/Homo_sapiens.GRCh38.90.chr22.gtf.gz

$ gtf_extract -f three_prime_utr --fields gene_id,gene_name,transcript_id,feature Homo_sapiens.GRCh38.90.chr22.gtf  | datamash -f -s -g2,3 count 3  | awk -v OFS="\t" 'NR ==1 {print "Gene_ID", "Gene_symbol","transcript", "3\047_UTR_Count"}; $5 >1 {print $1,$2,$3,$5}' 

Gene_ID Gene_symbol transcript  3'_UTR_Count
ENSG00000283809 AC007326.4  ENST00000638240 2
ENSG00000099889 ARVCF   ENST00000263207 2
ENSG00000099968 BCL2L13 ENST00000399777 2
ENSG00000099968 BCL2L13 ENST00000498133 3
ENSG00000015475 BID ENST00000342111 3
ENSG00000242259 C22orf39    ENST00000509549 21
ENSG00000093009 CDC45   ENST00000263201 2
ENSG00000093009 CDC45   ENST00000404724 2
ENSG00000093009 CDC45   ENST00000407835 2
ENSG00000093009 CDC45   ENST00000428937 5
ENSG00000093009 CDC45   ENST00000437685 2
ENSG00000099954 CECR2   ENST00000355219 3
ENSG00000070371 CLTCL1  ENST00000427926 2
ENSG00000070371 CLTCL1  ENST00000458188 4
ENSG00000070371 CLTCL1  ENST00000538828 2
ENSG00000070371 CLTCL1  ENST00000617103 8
ENSG00000070371 CLTCL1  ENST00000617926 2
ENSG00000070371 CLTCL1  ENST00000621271 2
ENSG00000070371 CLTCL1  ENST00000622493 2
ENSG00000093010 COMT    ENST00000207636 2
ENSG00000183628 DGCR6   ENST00000427407 3
ENSG00000183628 DGCR6   ENST00000480608 4
ENSG00000183628 DGCR6   ENST00000483718 3
ENSG00000100056 ESS2    ENST00000434568 4
ENSG00000277870 FAM230A ENST00000624459 5
ENSG00000215568 GAB4    ENST00000465611 6
ENSG00000100084 HIRA    ENST00000452818 4
ENSG00000243156 MICAL3  ENST00000495076 13
ENSG00000215193 PEX26   ENST00000474897 4
ENSG00000198062 POTEH   ENST00000343518 2
ENSG00000198062 POTEH   ENST00000452800 9
ENSG00000198062 POTEH   ENST00000621704 2
ENSG00000184702 SEPT5   ENST00000406172 2
ENSG00000184702 SEPT5   ENST00000406395 2
ENSG00000184702 SEPT5   ENST00000431044 2
ENSG00000184702 SEPT5   ENST00000438754 2
ENSG00000184702 SEPT5   ENST00000455843 2
ENSG00000184058 TBX1    ENST00000359500 2
ENSG00000184470 TXNRD2  ENST00000400518 2
ENSG00000184470 TXNRD2  ENST00000400521 2
ENSG00000184470 TXNRD2  ENST00000400525 2
ENSG00000184470 TXNRD2  ENST00000462330 2
ENSG00000184470 TXNRD2  ENST00000474308 2
ENSG00000184470 TXNRD2  ENST00000485358 2
ENSG00000184470 TXNRD2  ENST00000542719 2

Data mash is present in most of the system repos in linux distros.