Question

Genome Annotation

5

Entering edit mode

13.6 years ago

Charsonic_Wu ▴ 30

Hi all,

I am completely new to sequencing. I am a computer science student but I am working on a bioinformatics project on whole genome functional annotation.

My data is in csfasta format. How do I change this to fasta format? I am also very confused..what is the difference between the F3.csfasta file and the F5.csfasta file?

Additionally, I have been told that the data is in clc format..what does this mean?

How do I go about doing a whole genome annotation? Does anyone know of any good tools to do whole genome functional annotations?

I am extremely desperate and very very confused. Any information would be very much appreciated.

Thank you.

function genome • 5.6k views

ADD COMMENT • link updated 13.6 years ago by Barry ▴ 40 • written 13.6 years ago by Charsonic_Wu ▴ 30

Ram · Answer 1 · 2011-09-22

3

Entering edit mode

13.6 years ago

Carson ▴ 30

I can't help with the cfasta conversion, but I can with the annotation portion. There are basically two types of annotation that you might be referring to de novo or variant annotation. I'll try and describe both.

If this is a newly sequenced organism and you are doing de novo annotation (i.e no existing reference genome), you can use MAKER for structural annotation as well as MAKER and InterProScan for functional annotation. Also look at gmod.org for other annotation tools from the generic model organism database project.

If this is a human genome (or an organism with an existing reference genome), and you want to annotate functional variants, use BWA to align to the reference, GATK or samtools to identify and variants (SNPs and indels). Then use VAAST or annonovar to classify and prioritize the variants.

ADD COMMENT • link 13.6 years ago by Carson ▴ 30

1

Entering edit mode

FYI: There are two workshops on MAKER in the next month or so:

Sept 28-30, Genome Annotation course at UC Davis http://gmod.org/wiki/News/UC_Davis_Courses_this_September

Oct 14 at OICR in Toronto: http://gmod.org/wiki/October_2011_GMOD_Meeting#Scheduled_Satellite_Meetings

ADD REPLY • link updated 5.6 years ago by Ram 45k • written 13.6 years ago by Dave Clements ▴ 610

0

Entering edit mode

+1 for MAKER - makes life easy!

ADD REPLY • link 13.6 years ago by Yannick Wurm ★ 2.5k

Ram · Answer 2 · 2011-09-23

Also, to follow up on Carson's reply if this is ABI data for a novel genome and you're hoping to annotate the genome, you'll need to assemble it some how first. There are plenty of tools out there for this sort of task, and which one you choose will depend on a number of factors. Google will lead you to plenty of discussion - I'd have a look at Abyss (http://www.bcgsc.ca/platform/bioinfo/software/abyss) and then read a few threads like this (http://seqanswers.com/forums/archive/index.php/t-1424.html) to get a flavor for some of the issues involved. Coming from CS you'll feel right at home with all the technical details of the De Bruijn and Euler graphs involved in these tools - it's fun stuff!

score 2 · Answer 3 · 2011-09-22

2

Entering edit mode

13.6 years ago

Mdeng ▴ 530

Hey,

if you have data, where the filename is like "_F3.csfasta" there should be a corresponding "_F3.qual" file. Both files together are your reads, coming out the sequencer. Now, depending on which sequencing plattform has been used, you have "create/apply" your "pipeline". In the case that you are working on a whole genome project, the data should be whole genome seq.. The infix F3.xxx is meaning that these are single end reads, paired end would be R3.xxx.

First of all you should search for a pipeline, with the attributes of single end reads, your seq plattform and whole genome seq. You will find some ;)

So the steps would be:

Map your data to a reference (search for "hg18" or "hg19", human genome - 19 is newer) using maybe BWA
Call your SNPs, GATK or samtools
Annotate your SNPs, this is, also like the mapping, a science by itself. ATM I am using NGS-SNP.

These are the real basic steps.

ADD COMMENT • link 13.6 years ago by Mdeng ▴ 530

1

Entering edit mode

The official names of the human reference genome assemblies are NCBI36 and GRCh37, respectively (NCBI36 = hg18, GRCh37 = hg19).

ADD REPLY • link 13.6 years ago by Bert Overduin ★ 3.7k

0

Entering edit mode

Thank you very very much. That makes things clearer^^

ADD REPLY • link 13.6 years ago by Charsonic_Wu ▴ 30

score 1 · Answer 4 · 2011-09-22

1

Entering edit mode

13.6 years ago

Rob Syme ▴ 540

I've not had to mess around with colour space data before, but I'm pretty sure that the the instrument manufacturers ABI share software to do that sort of conversion. The software is Corona-lite which can be downloaded from here.