Hi,
I tried to process a RNA-Seq experiment myself, but when I upload the file in IGV and look at the coverage track, I see not an "equal distribution" of reads over the whole transcript (so roughly a plateau), but rather mountains .
My questions more precise: Maybe I cut the wrong adapters off? However, I firstly remove rRNA sequences with bowtie, this works. Or is there a parameter I have to change in my STAR command?
I downloaded the raw data from Geo: https://www.ncbi.nlm.nih.gov/sra/?term=SRA160745 More precisely here: https://trace.ncbi.nlm.nih.gov/Traces/sra/?run=SRR1630838
(If I am correct, the FASTQC report shows that the read length is between 25 and 40 nts, with apr. 50 million sequences.)
Briefly to how I processed the data:
1) I trimmed the adapters using cutadapt. I obtained the adapter sequences from the paper Gao et al. (2015) "Quantitative profiling of initiating ribosomes in vivo": "Deep sequencing. Single-stranded template was amplified by PCR by using the Phusion high-fidelity (HF) enzyme (NEB) according to the manufacturer’s instructions. The oligonucleotide primers qNTI200 (5 -CAAGCAGAAGACGGCATA- 3 ) and qNTI201 (5 -AATGATACGGCGACCACCG ACAGGTTCAGAGTTCTAC AGTCCGACG- 3 ) were used to create DNA suitable for sequencing, i.e., DNA with Illumina cluster generation sequences on each end and a sequencing primer binding site."
2) I used STAR for the alignment: STAR --genomeDir file/to/STAR_Index_Genome_100/ --alignEndsType EndToEnd --readFilesIn file.fasta --runThreadN 2 --outFilterMismatchNmax 3 --outFilterMultimapNmax 20 --chimScoreSeparation 10 --chimScoreMin 20 --chimSegmentMin 15 --outSAMattributes All --outFilterIntronMotifs RemoveNoncanonicalUnannotated --alignSJoverhangMin 500 --outFileNamePrefix /path/to/star_output/ --outReadsUnmapped Fastx --outSAMtype BAM SortedByCoordinate --quantMode GeneCounts --outWigType bedGraph read1_5p --outWigNorm RPM --outSAMstrandField intronMotif --outSAMmultNmax 1 --outMultimapperOrder Random
Thanks for your comment. I double checked with another RNA-Seq experiment (already processed, as bam file), same cell line (HEK293) and the bam file looks differently. However, looking at the Actin B gene I can totally reproduce your answer and the coverage tracks look identical in both experiments. However, there are genes where you can clearly see differences (e.g. ZNF598 and more). I just rechecked and saw that the RNA-Seq experiment from Geo is single-end, while my "comparison dataset" is paired-end. I expect both experiments to be comparable, because I also have ribosome profiling data and here the read counts correlate quite well with each other (spearman cor of 0.9). However, the RNA-Seq experiments correlate not so well (0.25).
Differences in protocols can indeed change the patterns you see. But it can also be the reflect of true biological variations between your samples (some genes are expected to be static while other should display more variations). Moreover, even if same cell lines should be similar they tend to vary from one lab to another (medium of culture, freezing/thawing etc). To reliably compare two RNA-seq you should have samples from the same experiment otherwise too many parameters can interfere.
How did you calculate the spearman correlation? You shouldn't compare the raw counts, because you have two different sequencing technologies and possibly two different sequencing depths.
To obtain a quick overview, I counted the reads per region with SummarizeOverlaps (R Bioconductor). Yes, I then correlated the raw counts against each other (for both Ribosome profiling and RNA-Seq). My first assumption was that different cell counts, read depth... just change the scaling. By correlating the ribosome profiling raw counts of both experiments, I obtained a spearman correlation of 0.9. By correlating the RNA-Seq raw counts of both experiments, I obtained a spearman correlation of 0.25. But I agree here that the sequencing technology (paired-,vs. single end) might be the crucial point.
I want to compare the ribosome coverage per transcript, (so Ribosome profiling reads/RNA-Seq reads per gene/transcript).