Hi, I'm a novice trying to analyse RNASeq data of rat brain regions that I got from Novogene. I ran Falco and the MultiQC report seems pretty decent:
So I proceeded to the trimming step. From what I gathered, quality trimming is perhaps not even necessary, but it's a good idea to trim adapter sequences, and it's best to do so according to the company's provided adapter sequences. Looking at the report, they seem to have done some filtering of their own:
(1) Remove reads containing adapters.
(2) Remove reads containing N > 10% (N represents the base cannot be determined).
(3) Remove reads containing low quality (Qscore<= 5) base which is over 50% of the total base.
And then there are a few pie charts listing the number and percentage of "Clean Reads", "containing Ns", "Low Quality", and "Adapter Related" for each sample. Based on the numbers, and the fact that the <is_filtered>
field is always N
in the FASTQ sample files, I conclude that I got filtered data.
However, Falco still detects adapter sequences (the four outliers all come from one sample: Illumina Universal Adapter and PolyG, from forward and reverse FASTQs - no idea why):
Then, I tried trimming with Atria. The report file says this about adapter sequences:
P5 adapter:
P5 -> P7' (5' -> 3')
AATGATACGGCGACCACCGAGATCTACAC[i5]ACACTCTTTCCCTACACGACGCTCTTCCGATCT
P7 adapter:
P5 -> P7'(5’ -> 3')
GATCGGAAGAGCACACGTCTGAACTCCAGTCAC[i7]ATCTCGTATGCCGTCTTCTGCTTG
Where i5
and i7
are sample-specific sequences listed in the FASTQ files themselves, e.g.
@LH00409:280:22KV5JLT4:6:1101:46681:1070 1:N:0:TCTTACCACG+TTACGTGAGC
[i5] [i7]
From what I understand, just specifying the first part up to the barcode should be enough, so I ran Atria with these parameters:
--adapter1 AATGATACGGCGACCACCGAGATCTACAC
--adapter2 GATCGGAAGAGCACACGTCTGAACTCCAGTCAC
And it _still_ filtered out sequences, albeit a small number, in each sample (e.g. 35998 out of 22092364 in one sample).
Now, my questions are:
- Have adapters been trimmed?
- If not, am I trimming them properly, have I correctly specified the adapters as arguments?
- Are the autodetected adapter sequences false positives?
- With all the forward, reverse, 5', 3' stuff, I'm starting to get everything mixed up. How would I specify these adapters to Cutadapt, which asks me for Read 1, Read 2, and within each, 5', 3', or anywhere?
And additionally, am I correct in understanding that, after the adapters have been taken care of, it's not necessary to quality trim this data due to how RNA STAR works?
Excellent, thank you very much.
By the way, just for my education, in this case, would my way of specifying adapters be correct? And how would they be specified to
cutadapt
?