Download Genes according to the number of Introns
3
0
Entering edit mode
8.2 years ago
fusion.slope ▴ 250

Hello,

i was wondering if there is any way to download genes according to the number of introns they have. Example:

download all the genes with 2 introns, all genes with 3 introns etc.. in mouse for example.

Any idea is really appreciated!

Cheers

RNA-Seq Introns • 1.7k views
ADD COMMENT
2
Entering edit mode
8.2 years ago

download all the genes with 2 introns

using ucsc data (column 9 is the number of exons , 13 is the gene name):

curl -s "http://hgdownload.cse.ucsc.edu/goldenPath/mm10/database/refGene.txt.gz" | gunzip -c | awk -F '\t' '($9==3)' | cut -f 13 | sort | uniq 

(...)
Zg16
Zic1
Zic2
Zic3
Zkscan14
Zkscan4
Znrd1as
Znrf1
Zscan2
Zscan22
ADD COMMENT
0
Entering edit mode

It should be pointed out that this is using introns per transcript, not introns per gene, which is rather poorly defined.

ADD REPLY
1
Entering edit mode
8.2 years ago

Probably the easiest would be to download the entire annotation database and process that for what you need.

What do you mean with 'download all the genes'? You need just the names, the sequence, the location,...?

ADD COMMENT
0
Entering edit mode

Something like this

for all the genes of mouse, how many of them have 2 intron? and then take the name of those genes the same for 2,3,4,5,n... and take the name of those genes..

i was thinking to make a script from the annotation file but i was wondering if there is already a platform, tool, r package etc.. that is doing this.

ADD REPLY
0
Entering edit mode

Additionally, what are you going to do with alternative transcripts? A single gene may have 2,3,5,7 or 34 exons depending on which transcript is used.

ADD REPLY
0
Entering edit mode

i do not need the alternative transcripts i just need this information for what i have to do, btw thanks!

ADD REPLY
1
Entering edit mode
8.2 years ago
anp375 ▴ 190

I'd really appreciate it if someone proof-checks this:

Download the mouse gtf. For protein coding genes only:

grep "protein_coding" mousefile.gtf > protein_coding.gtf
grep -e $'\texon\t' protein_coding.gtf | tr ' ' '\t' | tr -d '"' | tr -d ';' > exons.gtf
cut -f 1-5,16 exons.gtf | sort | uniq | cut -f 6 | sort | uniq -c > Maximum_number_of_exons_for_each_gene.txt

Doesn't really get the number of introns - it gets exons - but maybe it'll help. If you use it, you should implement each step separately to see if anything weird happens.

ADD COMMENT

Login before adding your answer.

Traffic: 1973 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6