Question

Download all intronless human genes.

1

Entering edit mode

6.9 years ago

caggtaagtat ★ 1.9k

Hi,

I would like to download the FASTA files of all genes without an intron.

R sequence gene genome • 1.9k views

ADD COMMENT • link updated 6.9 years ago by cpad0112 21k • written 6.9 years ago by caggtaagtat ★ 1.9k

score 5 · Accepted Answer · 2017-12-15

5

Entering edit mode

6.9 years ago

Nicolas Rosewick 11k

In R you could do something like this to get all transcript with only one exon (thus intronless)

library(biomaRt)
library(dplyr)

ensembl = useEnsembl(biomart="ensembl", dataset="hsapiens_gene_ensembl")
# will give you one line per exon
gene <- getBM(attributes=c('ensembl_gene_id','ensembl_transcript_id','hgnc_symbol','ensembl_exon_id'), mart = ensembl)

# extract all transcript with only one exon
gene.oneExon <- 
  gene %>% 
  group_by(ensembl_transcript_id) %>%
  summarise(n=n()) %>%
  filter(n==1)

# get sequence of associated transcript
gene.oneExon.seq <- getSequence(id=gene.oneExon$ensembl_transcript_id,type="ensembl_transcript_id", seqType="gene_exon",mart=ensembl)

ADD COMMENT • link 6.9 years ago by Nicolas Rosewick 11k

0

Entering edit mode

Thank you! I can't connect to server outside the university clinic via R, I think because of my building restrictions. However I will download all human exons manually from Biomart and only select those which have only one transcript!

ADD REPLY • link 6.9 years ago by caggtaagtat ★ 1.9k

0

Entering edit mode

Ok so you have to downoad the following attributes : "Transcript stable ID" and "Exon stable ID". Download it and then in R :

gene <- read_tsv(file.txt,col_names=c("ensembl_transcript_id","exon_id"))
gene.oneExon <- 
  gene %>% 
  group_by(ensembl_transcript_id) %>%
  summarise(n=n()) %>%
  filter(n==1)
write.table(gene.oneExon$ensembl_transcript_id,"tid_oneExon.txt",sep="\t",col.names=F,row.names=F,quote=F)

Then in biomart in the Filter tab , you can filter on Transcript Stable ID by giving the tid_onExon.txt file as input in "Input external references ID list". And in the Attributes tab, you can choose cDNA sequences

ADD REPLY • link 6.9 years ago by Nicolas Rosewick 11k

1

Entering edit mode

Hi, thank you for your fast reply!

I just went to the Biomart website and downloaded all coding sequences, after adjusting the filter, that there should be only one transcript, and the transcript should be protein coding. So I think I now have what I wanted.

However, I didn't quiet understand what the difference is between selecting to download the coding sequence versus downloading the cDNA sequence. Does it make a difference in intronless genes? Thanks again!

ADD REPLY • link 6.9 years ago by caggtaagtat ★ 1.9k

1

Entering edit mode

cDNA is the actual transcribed sequence. The coding sequence is the sequence that will be translated into protein.

ADD REPLY • link 6.9 years ago by Nicolas Rosewick 11k