Question

How Tophat Treats Additional Contigs

0

Entering edit mode

10.9 years ago

GR ▴ 400

Dear Group,

I have some additional contigs along with the chromosomes in my reference genome file. These contigs are parts of the chromosomes and the genomic locations of these additional contigs are known. These are kept in the reference file in the file by an algorithm (I am not going in details). My reference file looks something like this: >Chr1 >Chr2 >Chr3 >Chr1:1000 >Chr1:2000 (where >Chr1:1000 >Chr1:2000 are additional contigs. 1000 and 2000 are the location of this contig in the file)

In my gtf file I have information for only Chr1, 2 and 3. My question is how tophat will treat these contigs. For mapping the reads onto this contigs, will tophat pick the information from gtf file and treat it as Chr1 starting from bp1 or these contigs will be treated as separate chromosomes and the mapping will be done considering that information for gene models is not available? Please help.

Thanks, Ritu

tophat • 2.1k views

ADD COMMENT • link updated 10.9 years ago by Devon Ryan 104k • written 10.9 years ago by GR ▴ 400

0

Entering edit mode

Can you make it more clear?

ADD REPLY • link 10.9 years ago by Ashutosh Pandey 12k

0

Entering edit mode

Hi ashutoshmits, In my file >Chr1:1000 is a contig and is 200 bp long. So >Chr1:1000 contig spans from 1000-1200 on Chr1. In gtf file I have information for only chr1 2 and 3. My question is how tophat will treat this contig. For mapping the reads onto this contig, will tophat pick the information from gtf file and treat it as Chr1 starting from bp1? Does it make my question clear.

Thanks for the help!

ADD REPLY • link 10.9 years ago by GR ▴ 400

0

Entering edit mode

agree with dpryan79, if "...these contigs are parts of the chromosomes and the genomic locations of these additional contigs are known...", why do you have them at the first place?

ADD REPLY • link 10.9 years ago by Pavel Senin ★ 1.9k

score 0 · Answer 1 · 2014-01-13

How this will be treated by tophat is dependent on the flags you use. Since you're already using -G annotation.gtf, that narrows down the possibilities a bit. That then leaves us with two possibilites: aligning to the transcriptome only or aligning to both the transcriptome and then the genome. What tophat will try to do is (1) create a reference transcriptome sequence based on your annotation file to which it will align reads and then (2) align the remainder against the genome (followed by (3), split the reads that are still unmapped into segments, try to align and then join them to find novel junctions). When tophat makes the reference transcriptome it'll treat the "extra" contigs as separate chromosomes, so they won't affect anything at that step. If those contigs don't overlap features in the annotation file, then you'll just never get unique alignments to the regions from which the "extra" contigs arise (i.e., if Chr1:1000 is the same as the sequence on Chr1 starting at position 1000, then you'll only get multimappers in that region, since they'll align to both the original and the "extra" contig).

Why do you have extra contigs? It would seem to be easy enough to remove them.