Hey guys,
I need your general help.
We sequenced a plant without a reference genome. We got 100bp paired-end reads using HiSeq 2000 (two replicates each condition), put all sequencing reads together and did de novo assembly using Trinity to get a "reference" transcriptome.
I aligned the individual reads back to the transcriptome, using the alignReads.pl script within Trinity package. The command I used was:
$/PATH/TO/SCRIPTS/alignReads.pl --left Sample1_R1.fastq
--right Sample1_R2.fastq --seqType fq --target
REFERENCE_transcriptome.fasta --aligner bowtie
As a result, I got many results' names and sizes listed below, for one sample.
bowtie_out.coordSorted.bam 4.56GB
bowtie_out.coordSorted.bam.bai 12.7MB
bowtie_out.nameSorted.bam 3.86GB
bowtie_out.nameSorted.PropmapPairsForRSEM.bam 3.38GB
others are 0 bytes in size
My questions is, I want to get differential expressed genes (rather than transcripts) using DESeq, I should get count tables first. But I don't know which data listed above should be used and how to generate count tables? Such as using HTSeq or other programs or scripts.
Simon said, "If you must align against the transcriptome, make sure that you count for genes, not transcripts, and remove reads mapping to transcripts from more than one gene." And I just want to count for genes.
If I am wrong or not using softwares properly with statements above, critics and suggestions are greatly appreciated.
Thank you very much!
Yours sincerely, -lzsph
Hi ashuto,
Thanks for your reply. Since we don't have a reference genome available, we put all sequencing reads together to assemble a transcriptome as reference. Count tables generated by "htseq-count" script are easy for those species with a reference genome, which we don't have. I don't have any gtf/gff3 files. And I want to count for genes rather than transcripts. Have you get any idea of it?
Regards,
lzsph