I have 8 count-matrix's (from bacterial metagenomic DNA sequencing), with information regarding fragment-count, meaning number of fragments aligned to each gene. I got the fragment-count, as opposed to read count, using Featurecounts. I have normalized for gene-length (longer genes will map more reads), by dividing the fragment count by the gene length. However, due to the variances in the size of the FASTQ-files, I wonder if i should normalize for that too somehow? My guess is that the bigger FASTQ files, will map more reads to the contigs, thus giving unequal numbers in regards to the samples with smaller FASTQ sizes. My final goal is to compare the gene abundances BETWEEN the 8 samples, so relative numbers are fine.
All 8 samples were sequenced equally and are coming from the same environment, but in different timepoints. But the FASTQ-files still vary in size by a couple of 100 MB.
Thanks for your response. I will look into that. Do you suggest i do any other form of normalisation? I read about TPM, RPKM and FPKM. Or do you think normalising for JUST gene length is sufficient in this type of study? In the mentioned techniques, READ length is taken into account, but due to the fact that the read length is the same for all the samples, i suppose it's redundant?
The rest of the normalization should be done as recommended in the metagenomics packages, which I assume depends on the package. Additional normalization for gene length using TPM, for example, still makes sense, even if you first downsample to the same number of reads.