Dear all, I performed blast using ~200 gene sequences from reference organism against a draft genome dataset from related species ( containing 12million short reads(sr), each of which is 88bp). The following examplifies the blast result.
----------length, strand + /-, e-value ommited, some numbers are wrong---------------
sr-ID GeneID Iden. Start(sr) End(sr) Start(gene) End(gene)
Sxxx001 Gene1 100 1 36 2302 2334
Sxxx002 Gene1 98 1 75 313 348
Sxxx004 Gene1 100 3 43 481 519
Sxxx001 Gene2 100 8 78 2140 2172
Sxxx006 Gene2 97 2 88 280 312
Sxxx007 Gene3 100 1 56 862 897
Sxxx008 Gene3 100 6 78 2020 2055
Sxxx009 Gene3 100 5 77 3934 3972
I can get each short read sequence ranging from start(sr) to end (sr) . However, next step is troublsome; I need to assemble short reads to longer sequence according to the corresponding gene hit
. e.g., need to assemble Sxxx001, Sxxx002, Sxxx004 according to Gene1; and assemble Sxxx001 and Sxxx006 according to Gene2, etc.
This analysis aims to identify the homologous genes from the draft genome data. Could anyone help to describe ways to assembling sequences according to blast result?
THANK YOU in advance!!
If I get you right, your hits do not overlap: I.e. for Gene1
this means it looks like:
so there is no overlap between your hits, am I wrong?
Hi, Phil, Sorry for my unclear description. It's not my real result, just an example. You need not to think of the start/end positions. These short reads can overlap. What I need to do is to assemble these reads according to the corresponding genes. The following is my revised table (x indicates number)
In my original post, I am wrong saying to extract reads ranging from start to end. I just need to assemble these reads according to the gene hits. Maybe use de novo approach? I will much appreciate your suggestions. THANKS.