Question:
Is this featureCounts output normal, and how can I process it for DESeq2 analysis?
Hello,
I am working on RNA-seq data from a mouse genome and have used featureCounts to generate a count matrix for six samples (three controls and three knockdown). My command was:
featureCounts -a mouse.gff -o counts.txt sample1.bam sample2.bam sample3.bam sample4.bam sample5.bam sample6.bam
The output file includes a column for Geneid
, as well as columns for Chr
, Start
, End
, Strand
, and Length
, but the six sample columns (expected to contain counts) seem unusual. Instead of numeric counts for genes, I see repeated entries like NC_000067.7
like so:
NC_000067.7;NC_000067.7;NC_000067.7;NC_000067.7;NC_000067.7;NC_000067.7;NC_000067.7;NC_000067.7;NC_000067.7;NC_000067.7;NC_000067.7;NC_000067.7;NC_000067.7;NC_000067.7;NC_000067.7;NC_000067.7;NC_000067.7;NC_000067.7;NC_000067.7;NC_000067.7;NC_000067.7;NC_000067.7;NC_000067.7;NC_000067.7;NC_000067.7;NC_000067.7;NC_000067.7;NC_000067.7;NC_000067.7;NC_000067.7;NC_000067.7;NC_000067.7;NC_000067.7;NC_000067.7;NC_000067.7;
Where NC_000067.7 is the chromosome. There are repeated entries for other columns as well.
Questions:
- Is this normal output for featureCounts? If not, what could have gone wrong?
- Assuming this is resolved and I get a proper count matrix, how can I prepare this file for differential expression analysis in DESeq2? Should I modify or filter the file in any way before importing it into R?
The file is extremely large being about 50M with awk '{ print length($0) }'
printing this output with only 3 lines:
222
148
51520976
I have tried to chunk the large output file for downstream deseq analysis, but have had trouble with the strange formatting.
I am using the following:
- Mouse genome: RefSeq GRCm39
- GFF file, retrieved with the mouse genome through the NCBI datasets command line tool.
Any guidance would be greatly appreciated! Let me know if more details about my data or commands are needed.
I will give edgeR a go to see if the resulting feature counts file is easier to handle than in bash.
When I try to import the current feature counts output file into R to at least look at it and figure out how to chop it up to make downstream analysis easier, R hangs because the file is so large.
I can read 50MB files into R in a few seconds on my laptop. Is your R session very short on memory?
Anyway, you can undertake a DE analysis in a few lines, without writing or processing any featureCounts file. The following code follows on immediately from the code in my answer: