Beginning Bioinformatics Student In Need Of Advice And Clarification.
4
6
Entering edit mode
11.3 years ago
Caitlin ▴ 100

Hi all.

I am a beginning bioinformatics student enrolled in the one bioinformatics course my community college offers. The pace of the course is relaxed and includes an extremely fundamental series of discussions regarding perl (a language I am very experienced with) in the form of very simple programming assignments which even a programming neophyte could probably complete within 15-20 minutes, basic searching and retrieval from GenBank, multiple sequence alignment with various software tools, e.g., clustal omega, muscle, and an introduction to BLAST.

Since I am very interested in the field of bioinformatics, I felt compelled to ask for clarification regarding several fundamental topics that are, unfortunately, not addressed in the course syllabus. Apologies if my questions are overly simplistic:

1). If I were to download a complete human genome sequence, in what format would it be in? Fasta? Would it be a monolithic Fasta file or 23 files (one per chromosome) in Fasta format?

2.) I'm interested in using either perl or Java to examine various genes. Would locating specific genes be feasible?

3.) I have tried in vain to locate public data which consists of a "normal" gene and one from an individual afflicted with cancer, Example: Healthy BRCA1 and a copy of a BRCA1 gene with mutations that lead to the development of a neoplasm. I would like to compare them and identify the location of the mutations, etc. GenBank does not seem to store "mutated" sequence info. Rather, I have only been able to locate BRCA1 and BRCA2 sequence data for various organisms with no indication that the Homo sapien was or was not afflicted with a form of cancer.

If anyone could provide some helpful feedback, I would be very appreciative. Having such a strong interest in the field and no mentor to consult is, as you may imagine, frustrating.

Thanks all.

~Caitlin

perl java cancer • 5.2k views
ADD COMMENT
1
Entering edit mode

"very simple programming assignments which even a programming neophyte could probably complete within 15-20 minutes" I like this statement ;) and I wished that was true for everyone taking such courses, but I have been seeing people posting course assignments of this difficulty (EDIT: not meaning to say your specific course/class) here, trying to get immediate solutions out of biostar.

ADD REPLY
0
Entering edit mode

Thanks Micheal!

;)

ADD REPLY
7
Entering edit mode
11.3 years ago
Emily 24k

Hi Caitlin

  1. We keep all the Ensembl FASTA files here ftp://ftp.ensembl.org/pub/current_fasta. If you go into our DNA files for human (here ftp://ftp.ensembl.org/pub/current_fasta/homo_sapiens/dna/), you'll see that we have not just 25 files (1-22 + X + Y +mitochondria), but each of those unmasked (dna.chromosome), soft-repeat masked (dna_sm.chromosome), hard repeat-masked (dna_rm.chromosome), a complete genome file and loads of patches and haplotypes (see http://www.ensembl.org/Help/Faq?id=291 for more info in patches and haplotypes).

  2. Have a play with the Ensembl Perl API. Here's the tutorial http://www.ensembl.org/info/docs/api/core/core_tutorial.html, the documentation http://www.ensembl.org/info/docs/Doxygen/index.html and the installation instructions http://www.ensembl.org/info/docs/api/api_installation.html. We also have a REST API you could have a try with in Java or any other language you want to try http://beta.rest.ensembl.org/documentation.

  3. We do have the option in Ensembl to search by a disease state. From the Ensembl homepage (http://www.ensembl.org/index.html) you can search for a disease, for example breast cancer. You can then get a list of all variants in the genome associated with breast cancer (http://www.ensembl.org/Homo_sapiens/Search/Details?db=core;end=906;idx=Variation;q=breast%20cancer;species=Homo_sapiens), click through to find the associated allele and the genes affected, plus a bunch of other stuff.

You may have guessed - I work for Ensembl, but just because I'm biased, it doesn't mean our database/website isn't awesome.

Emily

ADD COMMENT
0
Entering edit mode

Hi Emily.

Thanks for the help. I have no doubt the info you provided will certainly prove beneficial. I have heard REST but I don't know anything about it (currently). Thankfully, my course project isn't due until early December of this year so I should have ample time to familiarize myself with the API and the Ensembl resources you provided links to!

Caitlin

ADD REPLY
0
Entering edit mode

Hi Caitlin

Our REST service lets you programme in another language but still access our API. We do this using simple URLs which generate data in a easy readable (by a computer) format. For example, try this URL: http://beta.rest.ensembl.org/feature/id/ENSG00000157764?feature=gene;content-type=application/json

You can see that it gives you a bunch of data in text format (it does this by accessing the Perl API). You can write code in any language you like to first generate that URL then read that data string, extract the bits of data you're interested in and display them in the form you like.

It's still in beta and there are a limited number of endpoints (unlike the Perl API which will allow you to extract every bit of data in our database), but it's still pretty cool.

Emily

ADD REPLY
4
Entering edit mode
11.3 years ago

1) both

ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/

http://hgdownload.cse.ucsc.edu/goldenPath/hg19/chromosomes/

2) yes for whatever-your-language-is. Search biostars.org for 'biomart' or 'ucsc mysql'

3) go from pubmed and find the related sequences: e.g: http://www.ncbi.nlm.nih.gov/nuccore?LinkName=pubmed_nuccore&from_uid=8533757 but I'm afraid there is no way to say if the patient was affected or not.

ADD COMMENT
1
Entering edit mode

You can also use the COSMIC database (http://cancer.sanger.ac.uk/cancergenome/projects/cosmic/) which catalogs somatic variants in cancer. This may be of use in your cancer related question. For familial cancer's you will find at least some of the possible germline mutations in OMIM.

ADD REPLY
0
Entering edit mode

Merci beaucoup pour l'aide Dr. Lindenbaum!

ADD REPLY
1
Entering edit mode

Just a quick comment, you can also get human genes here: http://genome.ucsc.edu/cgi-bin/hgTables By default, if you click "get output", the default setting is human genes. Or if you want a fun perl exercise, you can parse the genes out of this file ftp://ftp.ensembl.org/pub/release-71/gtf/homo_sapiens/ ... look up gtf format to get a handle on the format.

ADD REPLY
2
Entering edit mode
11.3 years ago
Michael 55k

Hi, welcome to BioStar.

With respect to 3) the mutations in BRCA genes were as far as I remember part of a partially invalid patent of Myriad, see The Myriad ruling - What do gene patents now mean to bioinformatics? I haven't checked but the sequence of the variants should be in the patent application. You could apply detection of the variants described in their genetic test as long as you do not generate cDNA (covered by the patent) but search in genomic DNA sequences. Myriads opponent in that case had developed and offered such genetic test, so the data for the causal variants should exist and searching the patent archives might reveal them.

ADD COMMENT
0
Entering edit mode

Thanks Micheal.

I didn't know there was a patent issue, but I will certainly check that link out.

ADD REPLY
2
Entering edit mode
11.3 years ago

You can find all sorts of mutations from cancer on theTCGA Data Portal. They won't come as fastas, but will provide the exact coordinates of the base change(s). If you needed, for some reason, to introduce them into your sequence, it would be easy enough to do with a little script (and maybe a good exercise for someone brand-new to bioinformatics). Best of luck.

ADD COMMENT
0
Entering edit mode

Thanks Chris. I will definitely check that site out.

ADD REPLY

Login before adding your answer.

Traffic: 1693 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6