Download all intronless human genes.
1
1
Entering edit mode
6.9 years ago
caggtaagtat ★ 1.9k

Hi,

I would like to download the FASTA files of all genes without an intron.

R sequence gene genome • 2.0k views
ADD COMMENT
5
Entering edit mode
6.9 years ago

In R you could do something like this to get all transcript with only one exon (thus intronless)

library(biomaRt)
library(dplyr)

ensembl = useEnsembl(biomart="ensembl", dataset="hsapiens_gene_ensembl")
# will give you one line per exon
gene <- getBM(attributes=c('ensembl_gene_id','ensembl_transcript_id','hgnc_symbol','ensembl_exon_id'), mart = ensembl)

# extract all transcript with only one exon
gene.oneExon <- 
  gene %>% 
  group_by(ensembl_transcript_id) %>%
  summarise(n=n()) %>%
  filter(n==1)

# get sequence of associated transcript
gene.oneExon.seq <- getSequence(id=gene.oneExon$ensembl_transcript_id,type="ensembl_transcript_id", seqType="gene_exon",mart=ensembl)
ADD COMMENT
0
Entering edit mode

Thank you! I can't connect to server outside the university clinic via R, I think because of my building restrictions. However I will download all human exons manually from Biomart and only select those which have only one transcript!

ADD REPLY
0
Entering edit mode

Ok so you have to downoad the following attributes : "Transcript stable ID" and "Exon stable ID". Download it and then in R :

gene <- read_tsv(file.txt,col_names=c("ensembl_transcript_id","exon_id"))
gene.oneExon <- 
  gene %>% 
  group_by(ensembl_transcript_id) %>%
  summarise(n=n()) %>%
  filter(n==1)
write.table(gene.oneExon$ensembl_transcript_id,"tid_oneExon.txt",sep="\t",col.names=F,row.names=F,quote=F)

Then in biomart in the Filter tab , you can filter on Transcript Stable ID by giving the tid_onExon.txt file as input in "Input external references ID list". And in the Attributes tab, you can choose cDNA sequences

ADD REPLY
1
Entering edit mode

Hi, thank you for your fast reply!

I just went to the Biomart website and downloaded all coding sequences, after adjusting the filter, that there should be only one transcript, and the transcript should be protein coding. So I think I now have what I wanted.

However, I didn't quiet understand what the difference is between selecting to download the coding sequence versus downloading the cDNA sequence. Does it make a difference in intronless genes? Thanks again!

ADD REPLY
1
Entering edit mode

cDNA is the actual transcribed sequence. The coding sequence is the sequence that will be translated into protein.

ADD REPLY

Login before adding your answer.

Traffic: 1869 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6