I have done WES for various vertebrate species and have obtained contigs after running velveth and velvetg with various hashes and obtained good N50 values. My question now is, if I want to try assembling the contigs into a genome without using a reference, what tool is available for that for non-mammalian vertebrates? I want to be able to annotate all the assembled genomes afterwards to extract all protein-coding genes - can this step simply be done without having to do anything further with the contigs (i.e. can I use the contigs.fa files produced by velvet straight into annotation?)
I would also try the process where I do map the contigs using a reference genome using bwa mem but I want to implement both methods to see how the results vary.
I did WES. I wanted to use allpaths-LG but none of my servers are linux based and it was not installing properly. Velvet was working fine so I don't see why it is a problem even if its old?
You can't assemble WES data. Have a look at a genome browser for how WES data actually looks like - small covered regions around exons. Exons are not overlapping or close enough to contain spatially relevant information.
And by the way - linux is really non-optional for bioinformatics ..... time to reformat a server, or at least get a virtual machine, or apply for a cluster allocation elsewhere.
But exome data has to be assembled or aligned in such a way to find variants (I'm not doing this but it is the most used process for exome data). Even if there are larger gaps it should be possible to assemble and obtain exons which can later be concatenated into coding sequences. With that said, most tools that "assemble" exome sequence data utilize BAM files to simply view the results against a reference sequence to locate variants - my goal is to skip that last step and just obtain the sequences (with the variants included) so I can do other analyses by gene. IGV consensus I find is not functional for me because I need to obtain all genes, not just one specific region of interest.
Mapping the sequencing data and calling SNPs and indels in relation to a reference genome is probably faster than assembling the exome, and is almost certainly more precise and less error-prone.
Assembling the exome will not give you coding genes, not even will give you exons, because the part of the introns at the exon boundary will also be captured. Such an assembly will be a big mess of hundreds thousand contigs, with no way of knowing which contigs belong to the same gene except by mapping them to an annotated reference genome.
Okay so now I am more uncertain of what to do because I have heard both sides. I have exome data - I can map them use bwa to a reference genome but then I have to obtain coding sequences. This can be done when my reference sequence is JUST a coding sequence, the sequence reads are mapped and after mpileup I can pull out my sequence no problem. This is a problem when I have to do it using the entire reference genome because my scripts now fail to pull out sequences (it tries to pull out one large sequence). Hence why I also thought to try de novo methods to obtain contigs, then map those contigs in some way or at least blast them to see where they belong against the reference genome.
If I did map my reads to the reference genome and get to the point of mpileup files, perhaps there is a way to use the .gff file to pull out exon boundaries? I have not worked with that yet, but I am not sure how else to proceed since I cannot find tutorials on my specific task unfortunately. I can view my bam files in IGV, I can see the SNPs but I need sequences (even at least by exon if not coding sequence because of intergenic regions).