Question

Is this featureCounts output normal, and how can I process it for DESeq2 analysis?

0

Entering edit mode

4 days ago

DdogBoss ▴ 20

Question:
Is this featureCounts output normal, and how can I process it for DESeq2 analysis?

Hello,
I am working on RNA-seq data from a mouse genome and have used featureCounts to generate a count matrix for six samples (three controls and three knockdown). My command was:

featureCounts -a mouse.gff -o counts.txt sample1.bam sample2.bam sample3.bam sample4.bam sample5.bam sample6.bam

The output file includes a column for Geneid, as well as columns for Chr, Start, End, Strand, and Length, but the six sample columns (expected to contain counts) seem unusual. Instead of numeric counts for genes, I see repeated entries like NC_000067.7 like so:

 NC_000067.7;NC_000067.7;NC_000067.7;NC_000067.7;NC_000067.7;NC_000067.7;NC_000067.7;NC_000067.7;NC_000067.7;NC_000067.7;NC_000067.7;NC_000067.7;NC_000067.7;NC_000067.7;NC_000067.7;NC_000067.7;NC_000067.7;NC_000067.7;NC_000067.7;NC_000067.7;NC_000067.7;NC_000067.7;NC_000067.7;NC_000067.7;NC_000067.7;NC_000067.7;NC_000067.7;NC_000067.7;NC_000067.7;NC_000067.7;NC_000067.7;NC_000067.7;NC_000067.7;NC_000067.7;NC_000067.7;

Where NC_000067.7 is the chromosome. There are repeated entries for other columns as well.

Questions:

Is this normal output for featureCounts? If not, what could have gone wrong?
Assuming this is resolved and I get a proper count matrix, how can I prepare this file for differential expression analysis in DESeq2? Should I modify or filter the file in any way before importing it into R?

The file is extremely large being about 50M with awk '{ print length($0) }' printing this output with only 3 lines: 222 148 51520976

I have tried to chunk the large output file for downstream deseq analysis, but have had trouble with the strange formatting.

I am using the following:

Mouse genome: RefSeq GRCm39
GFF file, retrieved with the mouse genome through the NCBI datasets command line tool.

Any guidance would be greatly appreciated! Let me know if more details about my data or commands are needed.

rna-seq featurecounts DESeq2 • 392 views

ADD COMMENT • link updated 4 days ago by Gordon Smyth ★ 7.7k • written 4 days ago by DdogBoss ▴ 20

score 1 · Accepted Answer · 2024-11-22

1

Entering edit mode

4 days ago

Gordon Smyth ★ 7.7k

Yes, that is normal output.

Have you considered using the simpler R-based interface:

library(Rsubread)
library(edgeR)
files <- c("sample1.bam","sample2.bam","sample3.bam","sample4.bam","sample5.bam","sample6.bam")
fc <- featureCounts(files, annot.ext="mouse.gff", isGTFAnnotationFile=TRUE)
y <- featureCounts2DGEList(fc)

This will give you a complete object in R including annotation as well as a matrix of counts.

ADD COMMENT • link 4 days ago by Gordon Smyth ★ 7.7k

0

Entering edit mode

I will give edgeR a go to see if the resulting feature counts file is easier to handle than in bash.

When I try to import the current feature counts output file into R to at least look at it and figure out how to chop it up to make downstream analysis easier, R hangs because the file is so large.

ADD REPLY • link 4 days ago by DdogBoss ▴ 20

1

Entering edit mode

I can read 50MB files into R in a few seconds on my laptop. Is your R session very short on memory?

Anyway, you can undertake a DE analysis in a few lines, without writing or processing any featureCounts file. The following code follows on immediately from the code in my answer:

Genotype <- factor(c("Control","Control","Control","KD","KD","KD"))
design <- model.matrix(~Genotype)
keep <- filterByExpr(y,design)
y2 <- y[keep,,keep.lib.size=FALSE]
y2 <- normLibSizes(y2)
fit <- glmQLFit(y2,design)
q <- glmQLFTest(fit)
topTags(q)

ADD REPLY • link 4 days ago by Gordon Smyth ★ 7.7k