Dear all:
new in this... just a quick question to ask for suggestions to tackle potential differential expression of two transcripts from the same gene, what I did:
- mapped and generated .bam files using tophat2
- "grep genename the.human.gtf.file" to get exon info from the original gtf file used in tophat2, this info saved to gene.bed file (keep only those that are exon)
- "bedtools multicov -bams bamfiles -bed gene.bed > output.bed" this is looped to get information for all the bam files...
the issues:
the bed file I generated has multiple transcripts ID, they share most exons. my idea right now is to look at exons that are different from each other.
For short exons, the count info is not reliable since many samples has no match
If I want to "normalize" by total match for the sample and exon length, will this be enough?
Thanks and appreciate any suggestions on packages or strategies... this is a time series study it will be nice to show different transcripts change over time....
So you want to know if the two transcripts of a same gene has different expression levels with in the same sample ?
Why don't you get the transcript level counts from tools like cufflinks/StringTie and then compare ?
Or quicker, with salmon/kallisto.
I disagree that read count comparisons between transcripts with different nucleotide compositions should be made, unless the dataset is from Pacific Biosciences single molecule real-time sequencing or has been statistically adjusted for GC biases.
GC bias is minimal these days. Further, things like Salmon can adjust for that if needed.