Good morning Biostars. I realize this might be a very specific question, and I have also tried to post it on the corresponding GitHub repo, but I'm afraid that might take a while.
For my project I've have been tasked to assemble a transcriptome for my species of interests, using as a guide the genome (not at a chromosome level, 2090 contigs) and using 23 Illumina samples.
After my initial assembly (hisat2 + stringtie2 + stringtie2 --merge pipeline) I've noticed that quite a few transcripts In my assembly covered two or more reference genes. After manual inspection, I've noticed that a great deal of these cases actually has only a few reads supporting the splicing sites. This is a known problem with stringtie. To solve this, I decided to increase the stringency (with -c 1.5 and -j 15), which lead to some improvements. My supervisor suggested that I instead concatenate all the alignments from all the 23 samples, and then feed that file to stringtie, increasing the -j parameter. I've since read the original paper on stringtie and got the idea that transcript expression levels are important for the assembly, but I'm not to sure not this.
Is it correct to use this approach, or I should assemble each sample individually?
I searched the literature and I have not found a single example of my approach. My logic is that by using all the reads as evidence I have better control of the -j parameter to decrease trans-splicing sites (splicing sites between different reference genes).