I have a RNA-seq dataset. I used Tophat
and Cufflinks
to align to my reference genome and make an assembly of my transcripts driven by a GTF
of my genome.
I have an interest gene (9 exons), I am looking if it is expressed in my data. When I look at it in the assembly, I found it split in 2 parts. The 3 first exons in a transcript. And the 6 last exons in another transcript. I was thinking that the expression level was too low to make a good assembly of this transcript. But when I look at the sashimiplot from the .bam
file used for the assembly, here is what I have:
I have enough reads supporting the junction, but cufflinks
reports 2 transcripts.
Any idea about why is this happening?
EDIT: Could it be comming from my parameters?
-u
--min-frags-per-transfrag 10
--max-multiread-fraction 0.99
--trim-3-avgcov-thresh 5
--trim-3-dropoff-frac=0.1
--overlap-radius 50
There supposingly are too few reads to support the junction between exon 3 and 4. Try to reduce the threshold (I guess it's
--min-frags-per-transfrag
) to a lower value, see what happens!that's the junction with the most reads in the whole transcript
do you know if these reads are uniquely mapped?
I just filtered out multimapped reads, and I have 67 reads supporting the junction instead of 77. It seems that this is not a multimapping problem :/
did you check how do they overlap the junction? maybe a majority of them only has few bases on one exon...
Nope, they overlap very well (not worse than any other junctions). I changed all parameters one by one, I always get this transcript split in2. The only way to have it full is to use faux-reads from a guide annotation (and I don't want to use a guide annotation).
I'm sorry I don't have any other ideas as I'm not an expert. For transcript quantification cufflinks has long been known for being sub-optimal, maybe some other software that is specifically built for assembly like SPAdes or Trinity would be a better choice?
Sorry, I thought you asked the next questions about N-stretches. I answered about why I use cufflinks and not other programs in the next comment. Thank you for the help!
How is the reference in that region? Is there an N-stretch in between?
Nope, there is not a single N in that region. I could use another aligner, but cufflinks is part of a pipeline I am using (not mine).
I retrieved the command line that runs cufflinks, the bam used, other files, etc...(in the pipeline code). I reran cufflinks separately from the pipeline changing parameters and I get the same result, with this transcript unassembled.
Anyway, I will keep in mind that I have to check when a transcripts is badly assembled if there are reads supporting the junction.
If you have any other ideas, you are welcome! Thanks for the help :)
The reads you're using... how did you filter them? Did you keep only primary alignments (
-F 0x0100
) and proper pairs (-f 0x2
)? Maybe it's not connecting those two exons because not all the reads that map there are not primary/properly paired.I did not filter them. When am I supposed to use
-F 0x0100
and -f 0x2
?samtools view
options. See: http://www.htslib.org/doc/samtools.htmlI just filtered my bam and checked manually, they are still there. I also checked manually on igv, those junction reads are not secondary and are proper pair.
Do they terminate before a splicing start site (f.e. "GT") and begin again after a splicing end site (f.e. "AG")? If not, maybe that's the reason.
It is :
XXXXGA GTXXXXXXXXAG CGXXXXX
Exon | intron | Exon
I think it could be because the two splicing sites have a couple of point mutations with respect to the U2 and U12 known splice sites, and even with that, they seem to look at each other and not to be consecutive:
Bottom line being:
The reads map in split mode across the intron, on the two flanking exons, so clearly the transcript belongs together, but the prediction of the gene model based on splice sites cannot do so, because the sites face each other and that's not a likely thing to observe wihtin a gene.
sorry but are you sure the splice site isn't correct? GU at the start of the intron and AG at its end is the standard thing, isn't it?
Actually, you're right. I thought they were 3 exons with the splice sites shown. If the middle one is an intron, then my answer is not correct!
sorry for being a killjoy ;)
I guess you are right, if you want to post this as an answer I could validate it. Anyway, it means that on other transcripts, this problem might not be always present so it is ok for me. Thanks a lot :)