Hi,
I did genome guided assembly using StringTie, It generate multiple isoforms, Can you please suggest how i can map these transcripts to genome again, to get only single accurately assembled transcripts for each locus.
Thanks
Hi,
I did genome guided assembly using StringTie, It generate multiple isoforms, Can you please suggest how i can map these transcripts to genome again, to get only single accurately assembled transcripts for each locus.
Thanks
From your comments I think what you want is redundancy removal. This can be done with -
1) Without the reference using Vmatch or CD-hit or uclust without the need of a reference - combine both denovo and genome guided assembly transcripts and take only the longest from the superset removing complete overlapping regions.
2) Use the reference to map the transcripts with a split-aware mapper and keep only the longest one in the region when they are overlapping subsets.
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
1)take only the longest from the superset removing complete overlapping regions? Please suggest if any tool is available, do you think CD-HIT is useful here.
2) Use the reference to map the transcripts with a split-aware mapper and keep only the longest one in the region when they are overlapping subsets.
What about second step, how can I do it?
1) Vmatch or CD-hit or uclust - These are all tools to keep the longest sequences. The commands for vmatch are as follows (these were 2 years old, not sure if there are changes) -
2) You can use GMAP or bwa-mem to map the sequences at high identity. Then use bedtools (cluster) or kent-utilities (bedRemoveOverlap) to remove the subsets or completely overlapping sequences.