Question

How to map genome guided assembly TRANSCRIPTS to Genome and extract the longest one for each genome locus

0

Entering edit mode

8.4 years ago

Bioinfonext ▴ 470

Hi,

I did genome guided assembly using StringTie, It generate multiple isoforms, Can you please suggest how i can map these transcripts to genome again, to get only single accurately assembled transcripts for each locus.

Thanks

RNA-Seq • 1.9k views

ADD COMMENT • link updated 8.4 years ago by Rohit ★ 1.5k • written 8.4 years ago by Bioinfonext ▴ 470

score 0 · Answer 1 · 2017-04-04

0

Entering edit mode

8.4 years ago

Rohit ★ 1.5k

From your comments I think what you want is redundancy removal. This can be done with -

1) Without the reference using Vmatch or CD-hit or uclust without the need of a reference - combine both denovo and genome guided assembly transcripts and take only the longest from the superset removing complete overlapping regions.

2) Use the reference to map the transcripts with a split-aware mapper and keep only the longest one in the region when they are overlapping subsets.

ADD COMMENT • link 8.4 years ago by Rohit ★ 1.5k

0

Entering edit mode

1)take only the longest from the superset removing complete overlapping regions? Please suggest if any tool is available, do you think CD-HIT is useful here.

2) Use the reference to map the transcripts with a split-aware mapper and keep only the longest one in the region when they are overlapping subsets.

What about second step, how can I do it?

ADD REPLY • link 8.4 years ago by Bioinfonext ▴ 470

0

Entering edit mode

1) Vmatch or CD-hit or uclust - These are all tools to keep the longest sequences. The commands for vmatch are as follows (these were 2 years old, not sure if there are changes) -

mkvtree -allout -pl -db sequences.fasta -dna -indexname dbname 
vmatch -d -p -dbcluster 100 0 -v -nonredundant nr_sequences.fa dbname

2) You can use GMAP or bwa-mem to map the sequences at high identity. Then use bedtools (cluster) or kent-utilities (bedRemoveOverlap) to remove the subsets or completely overlapping sequences.

ADD REPLY • link 8.4 years ago by Rohit ★ 1.5k