Hi all, I have come across something I have never seen before. I am working with some data from an outside source which appears to be processed RNA-seq files. Like other processed RNA-seq files I have ran into they are tab delimited files with columns for gene length, expected gene length, TPM, and counts for each probe identifier. Here is where things get weird, for any two samples and for the same probe set identifier the gene lengths are different and difference can be quite large! I have never seen this, the gene lengths have always been the same when working from sample to sample the expected lengths may vary a little bit. This ultimately has an effect on how the TPM is calculated and just makes me wonder what I am I missing. Does anybody have a clue why this might be the case.
What are the exact column names in the file? Do you know which tool was used to generate these files? I think I've seen RSEM do this with ENCODE GTF files (different
length
values for the same gene in different samples) so I am interested in your question as well.Could it be caused by having different runs with different read lengths?