We are looking at some of the TCGA's level 3 data on RNAseq to examine expression levels for a series of homologs with very high identity between exons. The level of identity, exon junctions, etc is such, that for instance, BLAT of individual exons gives back near perfect matches for several of the homologs and the exon junctions are matches nucleotide for nucleotide. For example, an average of 2bp in 100 will differ between exons. Since we know which individuals have homozygous deletions for the gene and we still see RPKM readings for these exons, we know this is incorrect and probably due to the incorrect alignment of sequence reads to the wrong exons.
What is the best method to use TCGA's level 1 data (BAM files, presumably) to get more exact matches for our gene(s) of interest? I guess we'd want to do some level of trimming to select high quality reads and allow minimum or no mismatching. Do the standard pipelines for analyzing RNAseq data do well in cases like this or is there a better method? To be clear, we know the gene model and transcripts which should arise in the presence of this deletion. If I missed any post that answers this kind of question, please let me know. Any help is appreciated.
I did not fully understand this "we know this is incorrect and probably due to the incorrect alignment" . My first impression would be that is easy to check and should be clarified first before anything else, are these alignments incorrect for most or not?
Or to rephrase that in what way is the alignment incorrect? It is only wrong if there were better places to align or if the gap is not reported.
Or it could be I am misunderstanding the problem.
A region of the genome which is deleted should not be able to have RNA aligned to it. The fact that TCGA does not know it is deleted may explain why this is still is happening. For people without deletions they are presumably correct. But before downloading 800+ RNAseq files, I wanted to know how to tackle this. Does that make sense?
it all depends what the size of the deleted region is and how the read overlaps with it and what the alignment filtering was - that region could still be the best match for the read and if so it will be aligned there
What tweaks could allow the highest quality sequences to be used? Trimming ends to drop low Q scores? Can we specify gene models that are missing those exons for individuals we know are deleted? And can we specify no mismatches permitted to make sure the wrong reads are not aligning?
I still think you are trying to fixing this from the incorrect point of view.
Wrong alignment means that a read would match better somewhere else yet that location was not found. This is what you need to establish first. Forcing the alignment process so that it does not align where you think it shouldn't is not the way to go about it. With that you are just generating custom data.
If a large section should be missing and you still see reads there then the data labeling or the experimental process was incorrect altogether. Your data represents something else than what you think it is.