Hi, I have human RNA-Seq (Total RNA) data of 24 paired samples. My samples are labelled as S1,S2,S3 and so on. S1-S2 form a pair and S3-S4 for another (odd represents Normal samples and even diseased samples). I have used STAR to align the reads and have obtained an overall alignment of > 90% for most cases. I have then used Stringtie for transcript assembly using the following command:
stringtie -p 3 -e -B -G reference_annotation_file -l S1 --rf -o ..S1.transcripts.gtf S1Aligned.sortedByCoord.out.quality.filtered.reads.bam
This command has been used individually for all the samples.
Then I have used the online script prepDE.py to convert the fpkm values of reads to raw counts. (http://ccb.jhu.edu/software/stringtie/dl/prepDE.py)
Thereafter I have used edgeR to perform paired differential analysis for the data.
I am facing several ambiguities with the data:
The annotation file was downloaded from Gencode (hg19 version). I had used the complete annotation including contigs, scaffolds, patch. Now, I have seen that there are several genes which have names but different gene IDs, let's say gene1_ID1 and gene1_ID2. In case of some of these genes, I find that both gene1_ID1 and gene1_ID2 are differential, but their logFC and FDR values are differing significantly.
My set of 24 samples have been classified into 2 groups, based on an additional factor being present and absent in the patients. Differential analysis was performed separately in the two groups and the results compared. In these I am finding that for some genes, gene_ID1 is differential in the total set (24 samples), while the same gene with different ID (gene_ID2) is found to be uniquely differential in one of the two groups (of 12 samples). How can this be?
In case of a gene with 2 unique IDs, I find that one is upregulated and the other is downregulated, though both are significant. I shall give the example here and provide the links for the gene with 2 IDs.
- One important gene that is known to be upregulated in my disease, as supported by numerous publications, has been upregulated in my sample set. However the logFC of this gene is only 0.5.
These results have left me confused and wondering whether I have made some mistakes in my workflow. I am relatively new to RNA-Seq data. In case I have not been able to clarify my doubts properly, please feel free to probe me further.
Could you clarify if you did any transcript assembly with StringTie? Also how did you take into account the paired nature of your samples in your DE analysis? Could you post a genome browser screenshot of the HOX4 gene problem?
Yes I did transcript assembly with StringTie using the command mentioned above. This generated the gtf files for each sample which were used as inputs for converting the fpkms to raw counts using PrepDE.py. Was I required to do anything else in this step?
The paired nature of the samples was known prior to designing the experiment. So there is no issue regarding that.
As for the Hoxc4 genes, right now I do not have access to my bam files as I am stuck at home due to lockdown. However I can state that the two HOXC4 genes (links given above) had logFC (-1.23 and 2.49) with FDRs (0.0038 and 0.048) respectively.
Is stringtie assembly still recommended for human genomes where you aren't expecting to find novel transcripts?
Did you run StringTie --merge as well? You can read more about the workflow here. And yes if you have a suspicion you migh have novel transcripts I recommend running StringTie.