I am completely new to sequencing.
I am a computer science student but I am working on a bioinformatics project on whole genome functional annotation.
My data is in csfasta format.
How do I change this to fasta format?
I am also very confused..what is the difference between the F3.csfasta file and the F5.csfasta file?
Additionally, I have been told that the data is in clc format..what does this mean?
How do I go about doing a whole genome annotation?
Does anyone know of any good tools to do whole genome functional annotations?
I am extremely desperate and very very confused. Any information would be very much appreciated.
I can't help with the cfasta conversion, but I can with the annotation portion. There are basically two types of annotation that you might be referring to de novo or variant annotation. I'll try and describe both.
If this is a newly sequenced organism and you are doing de novo annotation (i.e no existing reference genome), you can use MAKER for structural annotation as well as MAKER and InterProScan for functional annotation. Also look at gmod.org for other annotation tools from the generic model organism database project.
If this is a human genome (or an organism with an existing reference genome), and you want to annotate functional variants, use BWA to align to the reference, GATK or samtools to identify and variants (SNPs and indels). Then use VAAST or annonovar to classify and prioritize the variants.
Also, to follow up on Carson's reply if this is ABI data for a novel genome and you're hoping to annotate the genome, you'll need to assemble it some how first. There are plenty of tools out there for this sort of task, and which one you choose will depend on a number of factors. Google will lead you to plenty of discussion - I'd have a look at Abyss (http://www.bcgsc.ca/platform/bioinfo/software/abyss) and then read a few threads like this (http://seqanswers.com/forums/archive/index.php/t-1424.html) to get a flavor for some of the issues involved. Coming from CS you'll feel right at home with all the technical details of the De Bruijn and Euler graphs involved in these tools - it's fun stuff!
ADD COMMENT
• link
updated 5.2 years ago by
Ram
44k
•
written 13.2 years ago by
Barry
▴
40
if you have data, where the filename is like "_F3.csfasta" there should be a corresponding "_F3.qual" file. Both files together are your reads, coming out the sequencer. Now, depending on which sequencing plattform has been used, you have "create/apply" your "pipeline".
In the case that you are working on a whole genome project, the data should be whole genome seq..
The infix F3.xxx is meaning that these are single end reads, paired end would be R3.xxx.
First of all you should search for a pipeline, with the attributes of single end reads, your seq plattform and whole genome seq. You will find some ;)
So the steps would be:
Map your data to a reference (search for "hg18" or "hg19", human genome - 19 is newer) using maybe BWA
Call your SNPs, GATK or samtools
Annotate your SNPs, this is, also like the mapping, a science by itself. ATM I am using NGS-SNP.
I've not had to mess around with colour space data before, but I'm pretty sure that the the instrument manufacturers ABI share software to do that sort of conversion. The software is Corona-lite which can be downloaded from here.
You'll need to register with ABI, but I think it's free.
FYI: There are two workshops on MAKER in the next month or so:
Sept 28-30, Genome Annotation course at UC Davis http://gmod.org/wiki/News/UC_Davis_Courses_this_September
Oct 14 at OICR in Toronto: http://gmod.org/wiki/October_2011_GMOD_Meeting#Scheduled_Satellite_Meetings
+1 for MAKER - makes life easy!