Question

Querying the UCSC Genome Browser for DNA sequences (large collection of data)

0

Entering edit mode

10.5 years ago

mshumph2 • 0

I have a spreadsheet with 500 different positions on different chromosomes, and I'd like to pull out the DNA sequences between those positions. The spreadsheet is already set up in a way that could easily be related to the UCSC Genome Browser database if only I had a way to either upload my spreadsheet to the database or download the necessary tables. It seems like there must be a table that relates the position on the chromosome to a specific nucleotide, so I feel like if I found that table I could do this. So my question is, does anyone know of a way to do this? Is there an easier way to do this?

I tried connecting remotely to UCSC's MySQL server so that I could access the tables through MS Access, but I couldn't connect to it. I'm also somewhat familiar with Biopython if there's an easier way to do this using another database like NCBI's Nucleotide database.

Thanks

sequence • 3.2k views

ADD COMMENT • link updated 3.1 years ago by Ram 44k • written 10.5 years ago by mshumph2 • 0

1

Entering edit mode

If the sequences are all from the same genome I would recommend downloading the 2bit file for the genome and using a command line package like twoBitToFa.

For hg19, download this file (778 MB) and access it with this linux software.

If you'd prefer to do it in R, check out the BSgenome and DNAstrings packages from Bioconductor.

-Micah

ADD REPLY • link updated 5.3 years ago by Ram 44k • written 10.5 years ago by micahgearhart ▴ 40

Ram · Answer 1 · 2014-09-07

This is from the UCSC FAQ:

Download the appropriate fasta files from our ftp server and extract sequence data using your own tools or the tools from our source tree. This is the recommended method when you have very large sequence datasets or will be extracting data frequently.

Which is effectively what micahgearhart suggests. To get sequences from coordinates you could use getfasta in bedtools.

Ram · Answer 2 · 2014-09-07

0

Entering edit mode

10.4 years ago

mickael.leclercq ▴ 30

Have you tried to use Galaxy?

Step 1: Upload your coordinates in proper format (bed, gff...) with "Get data", and "upload file"

Step 2: Use the tool "Extract Genomic DNA" in the "Fetch sequences" category

ADD COMMENT • link updated 3.1 years ago by Ram 44k • written 10.4 years ago by mickael.leclercq ▴ 30