Question

Normalization of read counts

0

Entering edit mode

7.0 years ago

dllopezr ▴ 130

Hi everyone

I am performing a method to assign reads to enzymes based on read vs gene sequence alignment.

I know average gene size is about to 1000 nucleotides, and my read's length is 100, so if I get hits in three different positions of a gene isn't good to say that this gene is present 3 times, cause maybe is the same gene fragmented three times.

However, I believe that a normalization could be done through information about taxonomic marker genes. For example: If if found that 16s gene of bacillus subtillis is present 35 times and nirK hits for this species is equal to 3500. A relation 3500/35 can give me normalization of nirk genes present in the sample.

I have walked through literature but I find that taxonomic profiling tools report species in relative abundance, and it seems not useful for me.

Any ideas?

Thank you for your help

read counts • 1.4k views

ADD COMMENT • link 7.0 years ago by dllopezr ▴ 130

0

Entering edit mode

I definitely don't get all the details of your question -- could you perhaps clarify a) what the final desired outcome is (i.e., do you want to know the absolute expression strength of a single gene?), b) what data you have at hand (RNA-seq? qPCR? something else?).

ADD REPLY • link 7.0 years ago by Friederike 9.0k

0

Entering edit mode

Hi Friederike

I have DNA metagenomic shotgun sequences of a soil microbes community, so the reads could belong to any bacterial, archea or fungi genome and to any gene (or not genes regions).

I have a local database with coding sequences of a pool of genes of my interest, mainly enzymes of carbon, nitrogen and phosphorus biogeochemical cycles. The workflow is to make and alignment between the reads and the local database and obtain significant hits that reflects and estimate of the abundance of these genes in several samples.

And Ideal output will be:

Nitrate reductase of Bacillus subtilis: 35 counts Nitrate reductase of Bacillus amyloliquefaciens: 55 counts Acid Phosphatase of Bacillus subtilis: 50 counts Acid Phosphatase of Bacillus amyloliquefaciens: 25 counts

My question is based on my "fear" of report a overstimation of genes abundances, because if I get a significant hit for three reads in diferent - not overlapping section of a gene I don't now if these reads belong to a gene that have been fragmented three times, in this case the count will be 1, or if these reads belong to three genes, in this case the count will be 3.

So I am searching to a method to normalize this info, and taxonomy info of absolute abundances of organims will be helpful.

ADD REPLY • link 7.0 years ago by dllopezr ▴ 130

1

Entering edit mode

you should probably add the info in your original post and also mention "microbiome" in the title.

I still don't fully understand what you mean with this:

because if I get a significant hit for three reads in diferent - not overlapping section of a gene I don't now if these reads belong to a gene that have been fragmented three times, in this case the count will be 1, or if these reads belong to three genes, in this case the count will be 3

but maybe people with experience in microbiome analyses will. I do see your point, but I fail to understand the fear. The fragmentation rates should be more or less the same for the different genes and the likelihood of a given fragment to be sequenced should also roughly be same (GC and length biases put aside). For an individual gene, you may very well fall into the trap you described (because it might be at either extreme of the distribution), but generally, I don't think it's something that people worry about. But again, I might be completely missing the point here. Pinging people with microbiome expertise should hopefully be of more help.

ADD REPLY • link 7.0 years ago by Friederike 9.0k