Gidday,
I've been running de novo transcriptome assembly on a large set of RNA-seq data (n=48 samples).
I'd like to combine the per-sample assemblies into a meta-assembly (representing potential novel transcripts present in the entire sample set) that I can use for downstream analyses.
Due to the general noisiness of RNAseq and assembly, I would like to filter out transcripts that are present in < 5 samples. From what I've found, Bedtools Compare Multiple Bed Files? suggests a solution using bedtools multiinter that would identify intervals present in N samples. However, the toy example given there shows intervals being split up based on the #samples that have the interval - ideally, my solution would keep the longest interval within which the samples with matches fall (this would correspond to keeping the longest transcript covering the region in question).
My question is: are there other approaches possible? Is there a way to do this using e.g. GRanges in R?
Right now my plan would be to
- Remove transcripts corresponding to the reference transcriptome
- Use bedtools multiinter on my full set of assemblies
- Take the intervals where >=5 samples overlap
- For each interval, extract the relevant transcripts from the samples which have an overlap
- Take the longest transcript from each of these
- Use cuffcompare or cuffmerge with the reference transcriptome and the unannotated transcripts from step 4 -> gives a final, filtered high-confidence transcriptome.
However, if there's something out there which could do the whole workflow automagically, that would be grand.
Any thoughts and suggestions greatly appreciated, and thanks in advance!
Try renaming your headers in each of the assembly. For example, sample1 assembly should have transcripts with name
>sample1_contig_id1
,>sample1_contig_id2
. Similarly for sample 2, with>sample2_contig_id1
,>sample2_contig_id2
, .. so on. (A simplesed
command should do this)Then cluster them using
cd-hit-est
in such a way that the shorter transcripts with certain %identity and with %coverage would collapse into the longest contig. Those contigs which are present in multiple assemblies will form a bigger cluster. You can look into it and pick the clusters which have contigs from at least different samples.Because
cd-hit-est
selects the representative you should be able to use that non-redundant fasta (depends on parameters) for further analysis.Thanks, I will definitely try that approach out also!