Hi biostars
I used Plyester package to simulate some RNA seq data. Polyester provides simulated RNA reads FASTA files from a reference FASTA as output. For example, lets assume that I want to simulate 100 transcript. So I use the reference FASTA file and choose its first 100 transcripts using
> fatsa[1:100] #fasta is the reference FASTA file imported in R
Then I use HISAT to align my simulated RNA read FASTA files to reference genome (for example hg38) and htseq to obtain the read counts. But the problem is the read counts I'm given are a table of all transcripts in the GTF file. And also not all the transcript I had first chosen have been counted. To clarify what I mean, look at the example below
# transcript I want to simulate:
ENST00000456328.2
ENST00000456330.1
ENST00000457025.1
# And this is the table of read counts I'm provided with:
ENST00000456328.2 100
ENST00000454675.1 50
ENST00000456330.1 0
ENST00000545455.2 70
ENST00000457025.1 0
As you see I have been provided with more transcripts that I wanted to simulate and also the transcripts I wanted to sim have 0 counts but the transcripts I didn't want to simulate have counts of 50 and 70
I think the problem is that I should take a subset of GTF file with the first 100 transcript as the ones I had chosen for simulation and then use this subset of GTF file for obtaining read counts using htseq. Is that correct?
Do I need to have a subset of reference genome (hg38) for alignment of my simulated fasta files to bam files too?
Thanks
If you only want to simulate 100 transcripts then cut your transcript file that number. When
polyester
generates reads, it is going to do that using some model its code uses (you are a statistician so you should be able to dive into that code and check). Otherwise you have to generate a counts table yourself and THEN askpolyester
to generate the reads.Ideally the simulated reads should align only to those 100 transcripts but some may align elsewhere if you use the full genome/transcriptome when aligning.
Thank you. No, actually I want to simulate 10000 transcripts. 100 was just an example to clarify my question. I did cut my reference FASTA file but I didn't know if I should cut my gtf file to that number of transcripts when I'm gonna use htseq to obtain counts.
By transcript file you mean the GTF file?
I didn't understand why I should check Polyester's code. Polyester provides me with simulated RNA reads of the transcripts I choose.
Since you don't need the GTF file for read simulation I meant trim the original transcript fasta to 100 sequences. You could use a small GTF with only those 100 transcripts when you do the counting to speed things up.
If you want to prevent it from giving you 0 counts for some transcripts. You may need to change polyester code to prevent it from giving you zero counts. As @h.mon said below real datasets will have zero counts for genes so that is how polyester is modeling things.
Real RNAseq data has plenty of zero-count transcripts. How much will depend on various factors, such as homogeneity or heterogeneity of starting material (e.g cell line vs tissues), library prep kit (e.g. whole RNA vs poly-A selection), and so on. I don't know how Polyester simulates the reads, but if is trying to perform realistic simulations, it will include a sizeable proportion of zero-count transcritps.
This is how polyester works:
"Polyester is an R package designed to simulate RNA sequencing experiments with differential transcript expression.
Given a set of annotated transcripts, Polyester will simulate the steps of an RNA-seq experiment (fragmentation, reverse-complementing, and sequencing) and produce files containing simulated RNA-seq reads. Simulated reads can be analyzed using your choice of downstream analysis tools.
Polyester has a built-in wrapper function to simulate a case/control experiment with differential transcript expression and biological replicates. Users are able to set the levels of differential expression at transcripts of their choosing. This means they know which transcripts are differentially expressed in the simulated dataset, so accuracy of statistical methods for differential expression detection can be analyzed. Polyester offers several unique features:
Built-in functionality to simulate differential expression at the transcript level
Ability to explicitly set differential expression signal strength
Simulation of small datasets, since large RNA-seq datasets can require lots of time and computing resources to analyze
Generation of raw RNA-seq reads, as opposed to alignments or transcript-level abundance estimates
Transparency/open-source code"
As you see this package isn't good for big simulations. So maybe I simulate less transcript (for example 3000 transcripts)
But my main problem now is that when I align the simulated RNA reads FASTA files using hisat and then obtain read counts using htseq, about 70 percentage of my simulated transcripts are counted zero. Which is very disappointing to me. How can I use this simulated count table for my analysis when 70% of transcripts are zero.
And by the way, I defined the number of reads to be a function of transcripts length so longer transcripts would provide more reads.
So that's why I wonder why most of transcripts are counted zero.
No that is not what the quote above says. It is only saying that larger datasets require significant compute resources (as you have already discovered) so polyester is making it possible to simulate small datasets.
So provide a count matrix you like and then have polyester generate reads from it. I am not entirely sure why you are not taking this route.
See my comment immediately above. Make up a matrix you like.
Thank you so much @genomax. That is what I did. I generated a count table and gave it to polyester to generate reads based on that table. For example, the element i,j=200 of this count table would tell polyester to generate 200 reads for transcript i replicate j. And all the elements of the count table have been reasonably large. So that's why I don't expect to be given 0 counts when I obtain the count table of simualted rna reads using downstream analysis including hisat and htseq. Maybe there is something wrong about my pipeline that I don't know.
Are you simulating reads that are long enough to allow specific mapping? If reads multi-map then counting programs by default ignore those reads and they are not counted. This is done because in real life you can't be sure of the spot in the genome a read originated from, if it is mapping equally well at more than one spot.
This may be the reason why you are seeing 0 counts in your table. Enable the option in htseq-count (I think that is
--nonunique all
) to count multi-mapped reads and see if those 0 counts go away.I simulated the first 100 transcripts of the reference FASTA. htseq has this options: union, intersection-strict, and intersection-nonempty. As you suggested, I tried the least strict option which is intersection-nonempty. The number of aligned reads has increased but the number of transcripts with zero counts are still the same. I think I should trim the GTF file to those transcripts I chose to simulate.
What do you think?
Can you post the exact command you used for htseq-count? I am more familiar with
featureCounts
(http://subread.sourceforge.net/ ) where the equivalent option to count multi-mapping reads is-M
.I used htseq in usegalaxy.org. featureCounts? let me give it a try.
Wow. Thank you. FeatureCounts sounds to align reads only to my chosen transcripts. How? I don't know. But exactly my chosen transcripts have the counts. Other transcripts are counted zero except one. Now only 20% of my transcripts are counted zero and one unsimulated transcript is counted nonzero.
Glad to hear that. Now on to the next step of simulating those 10000 transcripts.
َthanks a lot genomax