Hi,
I'm curious to know if there is a standard tool or method to perform the following function:
Given some set of genomic alignments (e.g. a set of .bam files aligning to hg19), generate alignments to a set of transcripts from this genome represented by e.g. a GTF file.
So, I've seen people talk about doing the opposite before; going from alignments to the transcriptome and projecting them back to the genomic coordinates, but I want to go the other way --- a sort of "un-projection". Particularly (and this is key), alignments to a single genomic origin that correspond to multiple isoforms of a gene should generate multiple, output alignments.
Does anyone know of any software that would allow me to perform such processing?
Do you actually want the transcriptome coordinates or do you just want counts of things? The latter is more common since the former tends to not be useful.
Hi Devon,
I actually would like the transcriptome coordinates. Literally, I want to project the genomic alignments onto all annotated transcriptomes. I realize this makes the problem more burdensome, which is why I came here to see if anyone has attempted something similar.
I'd be surprised if there's not something prewritten to do this, but I'm not personally aware of it. If you've not found anything then you could always write something up. Using Rsamtools and GenomicFeatures should make this an easy enough thing to code (yes, that will be a bit slow).
What is the output? - a SAM record or otherwise reasonably complete alignment to the transcript?
Yes. The tool I'd imagine would look something like this.
Input: GTF file describing potential target transcripts, BAM/SAM alignment to the genome.
Output: SAM/BAM alignment to the target transcripts identified in the GTF file, where genomic alignments have been "expanded" to all of the transcripts they cover (i.e. a read may be unique in genomic location, but map to potentially many transcript --- all of these alignments should be output).
Like I said before --- I know of tools for going the other way, but not for going from genome -> transcriptome.
Interesting concept, I don't know of a tool that does this but it feels quite useful and possibly not that complicated (though I might not fully understand all the implications).
Wouldn't it be a matter of just shifting coordinates by a translation, the POS field -> Alignment POS - Each transcript's leftmost POS -> New POS, the CIGAR is already relative to the alignment.
Well, I agree that it's not that complicated, conceptually (though I see it taking a little time to round out all the rough edges). The motivation (mine at least) would be to be able to use existing alignments to a genome with RNA-seq quantification tools like RSEM, eXpress and (my new tool) Salmon, that work based off of alignments relative to a transcriptome.
Aha, now I get the rationale, not having to realign the sequences would indeed make it a whole lot easier to evaluate another transcript base methodology and would head off the criticism of not using the whole genome.
I think just the conversion tool on its own would be a quite the helpful tool in our arsenal!
Perhaps not useful but STAR aligner can do this if you add flag. In one command you can generate the splice-aware genomic mappings and then those projected to transcriptome alignments.
I used these transcriptome projection bam files as input for salmon quant alignment mode. I made sure they were unsorted but unsure if this is a good idea. This was for a short time. For me it was useful because I wanted genomic but I also wanted to do quantification so need transcriptome alignments.
Previous to this I was working with nanopore data and used minimap2 to align to transcriptome fasta file separately to aligning to genome reference.