Question

Can edgeR be used to find (over/under)represented sequences in fastq files?

1

Entering edit mode

9 months ago

AquaDeath ▴ 80

Hi, everyone!

I'm a young bioinformatician completely new to Biostars (first post!). The title sums up my question: I'd like to find a procedure to find short nucleotide sequences (10-15 bp) overrepresented or underrepresented in fastq files from RNA-seq. Basically, I'm not sure if edgeR could be suitable for this task.

For more context, my concrete case has an experimental design with control RNA samples (control.fastq) and other samples affected by a protein with RNase activity (case.fastq). The goal is to characterize if this RNase is preferent towards some concrete nucleotide sequence. I figured I could program in Python a script to count all sequences of nucleotides with N length in both control and case fastq files, and store them in a counts matrix (similar to the SummarizedExperiment class, but rows would be all possible nucleotide sequences instead of genes). This counts matrix can be loaded into R and apply edgeR GLM to extract sequences differentially represented in the files. However, I feel like maybe this method is overlooking some statistical considerations that edgeR internally uses. Until now I've only worked with edgeR using matrices with genes as rows, but never with concrete nucleotide sequences...

Thanks in advance for any help regarding this matter!

fastq edgeR rnaseq • 1.1k views

ADD COMMENT • link 9 months ago by AquaDeath ▴ 80

score 1 · Answer 1 · 2024-02-15

1

Entering edit mode

9 months ago

Kevin ▴ 100

I'm assuming your data for R1 looks something like [insert][adapter]. If that's the case, just focus only on R1: trim the adapter (using cutadapt or similar), then use fastq-uniq (from fastq-tools package). After that you can just do statistics, play around, generate and test hypotheses. It's a quick and dirty approach, but should be sufficient for you to get a feel for your data.

IMO no use doing differential expression analyses for this problem. I used to work with data like this all the time when I was developing small RNA-seq kits.

ADD COMMENT • link 9 months ago by Kevin ▴ 100

1

Entering edit mode

Thanks for your reply Kevin! My R1 files actually are already quality trimmed, so no adapters in the sequences (I already checked with fastqc). From fastq-tools man I've seen that fastq-uniq generates a fastq file whose headers contain the ocurrences of each individual read. However, the RNase activity probably targets a much shorter RNA region than the whole span of the read. Could you clarify how this file could serve to fetch for preferential target sequences (should be less frequent in case.fastq files respect to control.fastq)?

ADD REPLY • link 9 months ago by AquaDeath ▴ 80

1

Entering edit mode

Could you pls clarify what your library prep workflow looks like? Are you treating with an RNase (which RNase?) and using T4 PNK and a ligation-based prep? Totally guessing here but I was envisioning that the trimmed reads would exactly match your 10-15 nt fragments of interest.

ADD REPLY • link 9 months ago by Kevin ▴ 100

1

Entering edit mode

Sure. The RNA-seq was performed with Illumina paired-end (HiSeq), so in avarage reads are about 140-150 bp long. The samples are not per se treated with an RNase: cells are expressing a protein which collaterally cuts RNA upon activation. In other words, the RNase activity I'm investigating is not part of the library prep, but rather a biological condition which I want to characterize. Sry for not being clearer in that regard earlier.

ADD REPLY • link 9 months ago by AquaDeath ▴ 80

1

Entering edit mode

Do you know the exact name of the library prep kit or workflow?

I ask because only a "small RNA"-style prep is going to detect real 10-15 nt species which were present in the sample. In a standard RNA-seq (i.e. TruSeq) workflow, anything small is going to get removed during library prep cleanups.

If you're instead looking for 10-15nt k-mers across your 150bp reads, then you just want to use a k-mer counting algorithm. But as far as I understand your problem, that's not what you want.

ADD REPLY • link 9 months ago by Kevin ▴ 100

1

Entering edit mode

The RNA-seq workflow is standard, as we wanted to characterize differentially expressed genes with this sequencing run. Only ribosomal RNA was removed, and no consideration was made regarding small RNAs. The last paragraph of your message is exactly what I'm looking for with the files, small k-mers in reads that should have less coverage due to the RNase activity (if there are any preferential sequence, that is).

ADD REPLY • link 9 months ago by AquaDeath ▴ 80

1

Entering edit mode

OK you might want to try something like Jellyfish: https://github.com/gmarcais/Jellyfish

ADD REPLY • link 9 months ago by Kevin ▴ 100

0

Entering edit mode

Nice! Something like this tool is exactly what I was going to program on my own. Thanks and sorry for all the trouble in the other comments.

Now, when I obtain the k-mer count for each fastq file, is edgeR still suitable to obtain k-mers differentially represented in the samples? In other words, do k-mer counts still follow a negative binomial distribution similar to gene counts?

ADD REPLY • link 9 months ago by AquaDeath ▴ 80