Question

How to extract the DNA sequence 1000 bp around the Transcription start site (TSS) of a gene symbol in NCBI gene with R?

0

Entering edit mode

20 months ago

sunyeping ▴ 110

I am trying to extract the DNA sequence 1000 bp around the Transcription start site (TSS) of a gene, and I get a code as follows:

   # Load the biomaRt package
    library(biomaRt)

    # Connect to the Ensembl database through the biomaRt interface
    ensembl <- useMart("ensembl", dataset = "hsapiens_gene_ensembl")

    # Enter the gene symbol of interest
    gene_symbol <- "ITGAE"

    # Get the Ensembl gene ID for the gene symbol of interest
    gene_id <- getBM(filters = "external_gene_name", 
                     values = gene_symbol, 
                     attributes = "ensembl_gene_id", 
                     mart = ensembl)

    # Get the TSS position for the gene of interest
    tss <- getBM(filters = "ensembl_gene_id", 
                 values = gene_id, 
                 attributes = c("chromosome_name", "start_position", "strand"), 
                 mart = ensembl)

    # Extract the DNA sequence 1000 bp around the TSS
    seq <- getSequence(id = tss$chromosome_name, 
                       start = tss$start_position - 1000, 
                       end = tss$start_position + 1000,
                       type = 'ensembl_transcript_id',
                       seqType = 'transcript_flank',
                       mart = ensembl)

    # Print the DNA sequence
    cat(seq[[1]])

This can get the correct gene id, but cannot get the correct sequence. Do you have any idea how to correct this code? Or is there a better way to achieve the goal?

Any help will be deeply appreciated.

Transcription-start-site R biomart • 1.4k views

ADD COMMENT • link updated 20 months ago by Darked89 4.7k • written 20 months ago by sunyeping ▴ 110

0

Entering edit mode

It looks like the strand is not getting specified when running getSequence (and that there isn't an option for specifying strand) so are you getting the complimentary sequence?

ADD REPLY • link 20 months ago by jv ★ 1.8k

0

Entering edit mode

Additional info from getSequence

In MySQL mode the getSequence function is more limited and the sequence that is returned is the 5' to 3'+ strand of the genomic sequence, given a chromosome, as start and an end position. So if the sequence of interest is the minus strand, one has to compute the reverse complement of the retrieved sequence, which can be done using functions provided in the matchprobes package. The biomaRt vignette contains more examples on how to use this function.

ADD REPLY • link 20 months ago by jv ★ 1.8k

score 1 · Answer 1 · 2023-04-05

There is a bit of a problem with either getting one TSS per gene simply taking gene start and with one TSS per transcript. The first can give you TSS of a very rare transcript, the second a bunch of TSSs at the same position or off by few/10+ of bps.

Using ensembl canonical transcripts (one per gene) fixes "super rare transcript first exon gene start" but it may not be what you want.

Well, you can download GFF from ENSEMBL, select aset of genes, use https://github.com/junjunlab/GetTss to get BED with TSSs, then bedtools to extend it and extract fasta