Entering edit mode
6.9 years ago
caggtaagtat
★
1.9k
Hi,
I would like to download the FASTA files of all genes without an intron.
Hi,
I would like to download the FASTA files of all genes without an intron.
In R you could do something like this to get all transcript with only one exon (thus intronless)
library(biomaRt)
library(dplyr)
ensembl = useEnsembl(biomart="ensembl", dataset="hsapiens_gene_ensembl")
# will give you one line per exon
gene <- getBM(attributes=c('ensembl_gene_id','ensembl_transcript_id','hgnc_symbol','ensembl_exon_id'), mart = ensembl)
# extract all transcript with only one exon
gene.oneExon <-
gene %>%
group_by(ensembl_transcript_id) %>%
summarise(n=n()) %>%
filter(n==1)
# get sequence of associated transcript
gene.oneExon.seq <- getSequence(id=gene.oneExon$ensembl_transcript_id,type="ensembl_transcript_id", seqType="gene_exon",mart=ensembl)
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Thank you! I can't connect to server outside the university clinic via R, I think because of my building restrictions. However I will download all human exons manually from Biomart and only select those which have only one transcript!
Ok so you have to downoad the following attributes : "Transcript stable ID" and "Exon stable ID". Download it and then in R :
Then in biomart in the Filter tab , you can filter on Transcript Stable ID by giving the tid_onExon.txt file as input in "Input external references ID list". And in the Attributes tab, you can choose cDNA sequences
Hi, thank you for your fast reply!
I just went to the Biomart website and downloaded all coding sequences, after adjusting the filter, that there should be only one transcript, and the transcript should be protein coding. So I think I now have what I wanted.
However, I didn't quiet understand what the difference is between selecting to download the coding sequence versus downloading the cDNA sequence. Does it make a difference in intronless genes? Thanks again!
cDNA is the actual transcribed sequence. The coding sequence is the sequence that will be translated into protein.