I have RNASeq data coming from a transgenic mouse (where a single gene's coding sequence is replaced by the gene sequence from another organism). I need to quantify the expression of this transgene (get the number of aligned reads). It seems to me that the most comprehensive and accurate approach would be the following:
1) generate a custom reference genome (.fasta file), where the respective sequence is replaced with the new transgenic sequence 2) modify the entries for this gene in the gene annotation (.gtf) 3) use the modified reference genome and gene annotation to do the alignment and gene quantification.
Does that sound like the correct approach, or are there some issues that I don't see?
Also, I have problems with implementing my plan. For #1, I couldn't find a good tool to replace the sequence with another one in the .fasta file - could you recommend something? I am not proficient with python, so would prefer a ready-made tool.
Another concern that I have is that because the original and new sequences in the fasta file have different lengths, the whole annotation (or at least one chromosome) will be misaligned in relation to the modified reference genome. How can this be resolved?
Could anyone suggest other general approaches? Would it be more straightforward to use just a single gene sequence as a reference, and align the whole dataset to it? If yes, then what tool would you recommend?
Are you not interested in what happens to the rest of the mouse transcriptome (probably not, but good to confirm). The gene you replaced was a single copy (with no other genes that were similar in sequence) elsewhere in the genome? Was the replacement confirmed to be a single copy (or is that something you need to check)?
1) we are definitely interested in the whole transcriptome, but for this it seems to me that using the usual RNAseq pipeline with the original wild-type genome is sufficient - only the targeted gene will be affected, reads for all the other genes should be aligned/quantified normally - isn't that right? 2) the replacement was not confirmed as a single copy - this would be useful to check; how does that affect the approach?