Hi all,
I was wondering whether anyone has experience with de novo transcript assembly of RNA-seq (100 bp PE Illumina reads) of only one gene. We have about 50 RNA-seq library of human tissue and are at the moment only interested in one gene and want to know what all the expressed transcripts are of this gene. Are there specific packages/programs for this? or does anyone has some tips or ideas about this?
Thanks for your reply. The reason is when I look for this gene (and many others) in the Fantom5 transcription start site database there are many more potential transcription start sites than annotated. So this means that there might be more transcripts present then annotated in e.g. gencode or UCSC. For trinity would you recommend pooling all samples together initially for assembly or do it per library?
Pooling samples or not for denovo transcriptomes, this is one question with no obvious answer. By pooling you will get better continuity, but chances of mis-assemblies increase. Without pooling the low-level expressed transcripts will not be missed. Not pooling means more time and multiple runs. If you want to go for a quick and dirty approach why not pool the samples for a start, if you have enough computing power. If you reach the memory-limit, you might have to normalise the data.
To prevent high computational work. Do you think I can also just start with the already aligned bam files and extract the reads there from the gene of interest and turn them into .fastq files and continue from there or would this be a to biased approach?
You could do that. There should be no bias since you have aligned to whole genome. Are you going to opt for "region" of interest rather than gene of interest (or you would just use the co-ordinates for the longest gene model?) if you suspect that there are additional transcription starts.
Yes I was indeed thinking more about region of interest and based on some UCSC tracks and the mitranscriptome data.