Question

RNASeq: Confused about adapters with barcodes and seq company's filtering

0

Entering edit mode

4 months ago

Davor • 0

Hi, I'm a novice trying to analyse RNASeq data of rat brain regions that I got from Novogene. I ran Falco and the MultiQC report seems pretty decent:

Samples' quality scores

So I proceeded to the trimming step. From what I gathered, quality trimming is perhaps not even necessary, but it's a good idea to trim adapter sequences, and it's best to do so according to the company's provided adapter sequences. Looking at the report, they seem to have done some filtering of their own:

(1) Remove reads containing adapters.
(2) Remove reads containing N > 10% (N represents the base cannot be determined).
(3) Remove reads containing low quality (Qscore<= 5) base which is over 50% of the total base.

And then there are a few pie charts listing the number and percentage of "Clean Reads", "containing Ns", "Low Quality", and "Adapter Related" for each sample. Based on the numbers, and the fact that the <is_filtered> field is always N in the FASTQ sample files, I conclude that I got filtered data.

However, Falco still detects adapter sequences (the four outliers all come from one sample: Illumina Universal Adapter and PolyG, from forward and reverse FASTQs - no idea why):

MultiQC - Samples' adapter sequences

Then, I tried trimming with Atria. The report file says this about adapter sequences:

P5 adapter:
P5 -> P7' (5' -> 3')
AATGATACGGCGACCACCGAGATCTACAC[i5]ACACTCTTTCCCTACACGACGCTCTTCCGATCT

P7 adapter:
P5 -> P7'(5’ -> 3')
GATCGGAAGAGCACACGTCTGAACTCCAGTCAC[i7]ATCTCGTATGCCGTCTTCTGCTTG

Where i5 and i7 are sample-specific sequences listed in the FASTQ files themselves, e.g.

@LH00409:280:22KV5JLT4:6:1101:46681:1070 1:N:0:TCTTACCACG+TTACGTGAGC
                                                  [i5]       [i7]

From what I understand, just specifying the first part up to the barcode should be enough, so I ran Atria with these parameters:

--adapter1 AATGATACGGCGACCACCGAGATCTACAC
--adapter2 GATCGGAAGAGCACACGTCTGAACTCCAGTCAC

And it _still_ filtered out sequences, albeit a small number, in each sample (e.g. 35998 out of 22092364 in one sample).

Now, my questions are:

Have adapters been trimmed?
If not, am I trimming them properly, have I correctly specified the adapters as arguments?
Are the autodetected adapter sequences false positives?
With all the forward, reverse, 5', 3' stuff, I'm starting to get everything mixed up. How would I specify these adapters to Cutadapt, which asks me for Read 1, Read 2, and within each, 5', 3', or anywhere?

And additionally, am I correct in understanding that, after the adapters have been taken care of, it's not necessary to quality trim this data due to how RNA STAR works?

atria rnaseq adapter-trimming • 709 views

ADD COMMENT • link updated 4 months ago by GenoMax 152k • written 4 months ago by Davor • 0

score 2 · Accepted Answer · 2025-03-18

Illumina Universal Adapter and PolyG,

Poly-G are a common observation with 2-color chemistry (where no signal is equated to G). These reads should be discarded by all aligners when they align the data. If you want to deal with these spurious reads you can use a tool called polyfilter.sh from BBMap suite (see --> New Illumina error mode, new BBTools release (39.09) to deal with it ). If you do get BBTools then you could also use bbduk.sh which is the scan/trimming program. Look at the in-line help to understand how to use it. There are many threads here that show you basic bbduk commands. fastp is another good option.

am I correct in understanding that, after the adapters have been taken care of, it's not necessary to quality trim this data due to how RNA STAR works?

Correct. Strictly you do not need to trim. quality or otherwise. Most aligners should soft-clip parts of reads that do not align to the reference (which should remove all extraneous sequence). So they would be "taken care of" at the time of alignment.

If you ever wish to do any de novo work you will want to completely remove extraneous sequences before you do any additional work.