Hi there,
I've been tasked with annotating genetic sequences for a work project. I have never done this before so I have been looking into different methods, and I'm feeling a bit overwhelmed! I'm hoping someone here can help point me in the right direction.
In other posts about annotating, I've seen recommendations for Galaxy, BLAST, e!ensemble, Augustus, and more. I've played around with all of these programs but am still unsuccessful.
I have around 1,000 regions with sequences in L1 subfamilies. The goal of this research has been to see if there are sub-sequences/motifs that are repeated more than others. I started with a bed file of regions, retrieved fasta data with twobittofa, and ran multiple sequence alignments with clustalo. Now, I have a visualization that I am happy with, except I am missing annotation information.
My visualization looks like this:
and I'd like to have bars across the top to represent areas that are protein coding, known motifs, etc. I'd like to end up with something similar to this figure, with annotated information above the sequence information:
I get errors when I use e!ensemble's BioMart. I have the sequences in a text file formatted like:
chr12:70406846:70412860:1,chr3:177106559:177112539:1,chr1:174812365:174818381:1
and I subset the data to fit in their suggested 500 maximum sequences. This program works when I go one sequence at a time, but I have way too many for this to work.
Mostly I am wondering, of all of these various tools, what tool should I focus on learning to use to accomplish my goal? How do I find out what regions are coding or not coding in my sequences?
Thank you!