Hi,
I used STAR , cufflinks and then cuffmerge to merge every assembly into on megred assembly
STAR --genomeDir $stargenomeDir --outFilterIntronMotifs RemoveNoncanonicalUnannotated --sjdbOverhang 49--outFilterMismatchNmax 10 --readFilesIn $r1 $r2 --runThreadN $threads --readFilesCommand zcat
cufflinks -g gene.gtf aligned_reads.bam
cuffmerge -g gene.gtf -s genome.fa -p 20 assemblies.txt > cuffmerge.gtf
So I checked the merged.gtf file in IGV to compare it with the gene.gtf (a not so good annotation file..) Why all these transcript are not merged into one transcript. There are obvioulsy the same !
Did I forget something ?
Thanks in advance,
N.
You have only one sample?
No I have 13 samples.
you are not alone. I have exactly the same issue. cufflinks keeps "proposing" transcript models which I would merge without any hesitation. I looked a bit at alignment and it seems like it makes a decision to split models on _very few_ splicing events (i.e. single or two of these). Unfortunately, I don't know how to handle this too.
I wouldn't say they're obviously the same, there are small differences between most of them. Often, most variation between transcripts occurs at the 3' and 5' utrs. Have you looked at the actual start and stop coordinates in merged.gtf? Also sometimes you can get a lot of spurious transcripts being retained when coverage is patchy, ie. there isn't enough data to support merging transcripts. It also depends on what was in genes.gtf.
If you avoid the use of the guide GTF, I think that it may merge them together.
Another thought: It may struggle to merge transcripts that have exactly the same length and location in the genome that also have different exon usage, as is your case.
Try CuffCompare and fiddle around with the parameters relating to merge distance (bp)