Question

How to download all premRNA sequences (exons + introns) for human GRCh38 with Ensembl IDs?

0

Entering edit mode

6.1 years ago

O.rka ▴ 740

I'm not seeing anything on the FTP website that has this info: ftp://ftp.ensembl.org/pub/release-94/fasta/homo_sapiens/

My main goal is to get the exons in uppercase and the introns in lowercase. I tried using this but it dot work for many of the identifiers and takes a long time:

# Download Ensembl Sequences
def download_ensembl(ids, path_fasta="./premrna.fa", mode="premrna", file_log=sys.stderr, drop_N=True):
    """
    Download transcript sequences to `path_fasta`.  Can be either `premrna` or `cds`.
    Ensembl REST API

    """
    server = "https://rest.ensembl.org"
    # Create file handle
    if type(path_fasta) == str:
        handle = open(path_fasta, "w")
    else:
        handle = path_fasta

    with handle as f:
        for id in tqdm(ids):
            if mode == "premrna":
                ext = "/sequence/id/" + id + "?content-type=text/plain;mask_feature=1"
            if mode == "cds":
                ext = "/sequence/id/" + id + "?type=cds"
            try:
                r = requests.get(server+ext, headers={ "Content-Type" : "text/x-fasta"})
                if not r.ok:
                    r.raise_for_status()
                seq_record = r.text.split("\n")
                id = seq_record[0][1:]
                seq = "".join(seq_record[1:])
                if drop_N:
                    seq = seq.replace("n","").replace("N","")
                print(">%s\n%s"%(id,seq), file=f)
            except requests.HTTPError:
                print("\nHTTPError: Invalid identifier `%s`"%(id), file=file_log)
    if handle is not sys.stderr:
        handle.close()

RNA-Seq database ensembl human sequences • 2.3k views

ADD COMMENT • link updated 6.1 years ago by Biostar 20 • written 6.1 years ago by O.rka ▴ 740

1

Entering edit mode

Hello O.rka ,

for working with multiple identifiers you should use the POST endpoint. Using this you just send one request and get the whole result. There is no need to start a request for every new identifier.

I tried using this but it dot work for many of the identifiers

Can you give examples please?

fin swimmer

ADD REPLY • link 6.1 years ago by finswimmer 16k

0

Entering edit mode

Definitely not going to try and help until we get those examples Fin asked for.

ADD REPLY • link 6.1 years ago by Emily 24k

0

Entering edit mode

This can be tricky, since you have overlapping exons. However, for the download of all exonic sequences, you could use the Biomart from Ensembl. You could maybe then download all unspliced Transcripts as well and change all exonic sequences within the respective transcripts to uppercase.

ADD REPLY • link 6.1 years ago by caggtaagtat ★ 1.9k

0

Entering edit mode

Any way to make the script I have above quicker? It's going to take forever for hg19: 1%| | 1226/104763 [54:47<192:16:23, 6.69s/it]

ADD REPLY • link 6.1 years ago by O.rka ▴ 740

score 0 · Answer 1 · 2018-12-20

0

Entering edit mode

6.1 years ago

popayekid55 ▴ 110

download all the unspliced transcripts from ensembl biomart then mask using bedtools maskfasta

ADD COMMENT • link 6.1 years ago by popayekid55 ▴ 110