Hi everyone,
I am a newbie in the field of Bioinformatics and have very little experience and knowledge of tools and applications used in pipelines. I apologize if my question may come off as inadequate. I will clarify the question or give more information as requested. I know I have omitted some of the steps in my workflow below. I have painstakingly spent a lot of time trying to figure out how to reach the next step in my RNA-seq pipeline. I have ran through most of the "conventional" pipeline workflow.
I am looking for novel transcripts. I have paired-end reads from timepoint samples. I merged the bam files resulting from the alignment to increase signal of counts. I then proceeded to use Cufflinks. I want to utilize the resulting transcript file from Cuffcompare which I have filtered to find novel transcripts, for the counts which only show increased expression in genes across the timepoints (used IGV ). At this point I want to filter for those counts that have expression increase in these unknown(potential novel) transcripts. I was told I could use HTseq from python to do the job by using the BAM and GTF files. But I cannot seem to figure out which method to use. I must mention that I have indeed looked at http://www-huber.embl.de/HTSeq/doc/overview.html, perhaps I am missing something or am I even in the right direction? Can anyone please point me to the right direction?
In a gist my question is: How do I use HTseq to assign counts to a custom transcript file based on increased expression across time points?
I appreciate your time and I hope to learn more from your responses
Thank you in advance!
Oh sorry, I did not clarify too well. Steps 1 to 3 have already been carried out by a lab partner who used HTSeq to get the counts and DESeq2 for the differential expression analysis (I should note: Gfold and Cuffdiff were also utilized but we were not convinced of the results). After the above I returned to the original sorted aligned bam files merged them and ran the above workflow. I used cuffcompare to compare the refseq.gtf with our data transcript file. I obtained the transcripts.gtf.tmap and filtered by unknown intergenic transcripts, FPKM and base-pair length. I then looked at these transcripts on IGV, and loaded with them original bam files. I saw most transcripts had no gene expression for specific time points. I now just want to select those transcripts which have increase or significant gene expression across all timepoints.
Sorry for not being clear
Ah, HTSeq-count and featureCounts are far from ideal for transcript level counting. If you find cuffdiff's transcript-level results funky then I would do the following:
This isn't directly answering your exact question, but I think you'll get better results this way.
Thank you very much! I will give this a try!