Helle everyone,
Let's say I have a data set of 1000 genes gathered in a fasta file. Is there a way to blast them on de novo assembled and unannotated contigs and extract only those which are found complete (i.e. full length) and with 2 copies ?
The problem with blast is that I only got hits which do not often correspond to a full-length query gene due to more variable parts. As a result, subject genes are split in hits. As I want to see which of my 1000 genes are present in two copies in my contigs, I can not check by eye each of the 1000 blast tables to sum the query cover for each subject gene ID and guess the number of copies. Besides, I want to extract those complete genes and the blast action only return a table of hits.
Thanks for your suggestions!
Right, I'm working on a diploid plant. I assume there are two copies because it is a 1st-generation interspecific hybrid so reads from either chromosome should not merge as the hybrid's parents are slightly different. I did not performed the assembly but I think Velvet was used to build contigs.
I also thought I could annotate the whole contigs and then look for my 1000 genes but annotation jobs take a lot of memory on the bioinfo cluster (i.e. EuGene, Augustus,...). Since I'm not interested in the expected 30,000 genes that compose the genome, I prefer using a tool that only look for my 1000-gene set and extract the 2 copies. Besides, the assembly may not be of a good quality and I think it won't be a good idea to annotate it.