Hey, I have some paired-end RNA seq samples with which I want to perform DE analysis (aligning to the human genome). There are multiple samples but for simplicity, I will just take one of them. First, I performed FASTQC on the samples, the results of which are attached below.
Now, there are a total of 26912459 read pairs and all of them have a length of 161 bp (sequence length distribution of R2 not given since it's the same as R1). Since there were adapters (Illumina Universal Adapter) present, I decided to trim them using Trimmomatic.
Initially, I used the following code
trimmomatic PE -threads 16 sample_R1.fastq.gz sample_R2.fastq.gz -baseout sample_trim.fastq.gz ILLUMINACLIP:TruSeq3-PE.fa:2:30:10 MINLEN:30
This gave 4 files as output 2 paired files and 2 unpaired files. I ran FASTQC on all 4 files, the results of which are shown below.
Now, trimmomatic worked great and removed the adapters completely. Also, the sequences have an equal length of 161. But, now the number of sequences in the paired files is 13834556, which is ~ 51.5% of the original reads. The remaining reads seem to have gone to the file 'sample_trim_1up' which is the forward unpaired file containing the orphan reads from R1. FASTQC statistics of this file are shown below.
This file contains 13075592 reads which are ~48.6% of the original reads. Also, the sequence length distribution is extremely wide ( base statistics says it contains reads 30-153 bp in length). Also, adapters are completely removed from this file as well. Of note, the 'sample_trim_2UP' file is empty.
Since I was losing almost half of my reads, I decided to use the parameter keepBothReads as TRUE which (according to what I understood) stops the default behavior of removing palindromic R2 reads and hence orphaning the cognate R1 reads. The new code is as follows
trimmomatic PE -threads 16 sample_R1.fastq.gz sample_R2.fastq.gz -baseout sample_trimnew.fastq.gz ILLUMINACLIP:TruSeq3-PE.fa:2:30:10:8:TRUE MINLEN:30
I took parameter minAdapterlength as default 8 which the manual says is conservative and can be taken as low as 1. Now, once again I get 4 files 2 paired and 2 unpaired. This time, both unpaired files are empty. The fastQC for sample_trimnew_1P is shown below.
This file contains 26910148 reads which are 99.99% of the original reads (the output said 2311 reads were dropped). Here, again the adapters are removed and I get back almost all my reads but the sequence length distribution is varying ( base statistics say it contains 30-161 bp reads).
Here are my questions:
Which of the paired trimmomatic output files would be suitable for downstream alignment using STAR? I hope the 'trimnew' files using keepBothReads as TRUE can be used since I wont be losing any reads.
Why does trimming change of read length distribution of an initially uniform library?
Will a trimmed library of varying sizes (30-161) be suitable for STAR alignment? In particular, STAR uses a parameter called 'sjdbOverhang' during genomeGenerate ( to make the index) which should be 'read length - 1'. For an uneven library, what should this be?
Sorry for the overly long post, thanks in advance
So I had this question.. when an RNA seq library is made, there is a distribution of fragments centered around the fragment size that we want. Also, adapters have a fixed length. So the input reads to the sequencer have a varying length. Now, how does the sequencer give us uniform reads (say 161 bp in my case?). Suppose the length of fragment+adapter is less than 161. Does the sequencer simply ignore that read?
That is because you are sequencing for a fixed number of cycles e.g. 161 in your case. If you were to sequence longer you will get longer reads until following happens (see below).
No. In that case sequencer will finish sequencing the insert and then start sequencing adapter on 3'-end. If an insert was really short then a random sequence of nucleotides may be generated once adapter ends and you go into the adapter lawn on the flowcell.
Thanks a ton for the answer!! Just one last comment from what I've gathered.
For small fragments in paired-end sequencing, both the forward and reverse read will overlap fully and contain adapter readthrough at the 3' end. Now, after trimming the adapters, what's left are just two sequences which are reverse complements of each other. So, the default behavior of trimmomatic is to just remove R2 from such a pair which orphans R1 and makes it go into the R1_unpaired file. These reads can be handled by an aligner that can handle both paired and single-end reads together. Now, STAR (which I will use) requires paired-end reads. Hence, I should avoid the default behavior and set keepBothReads to TRUE so that the R1 reads are not orphaned and can be used downstream.
Is this correct? Thanks again!!
Most aligners can't use a mix of single-end and paire-end data at the same time so I would suggest that you stick with properly paired reads that survive trimming. Cases where one read survives predominantly (if the problem is of short insert then both reads should be subject to same restriction, unless there are problems with Q scores in other) may have problems with the data that you may want to look more closely separately. Now a days one has millions of reads to work with so if a small fraction needs to be set aside then that is not much of a loss.