Question

Too Many Genes Under Positive Selection

1

Entering edit mode

11.8 years ago

Plantae ▴ 390

Our lab sequenced 7 genomes from the same organism, the divergence time between these genomes is less than 0.5 Mya.
~ 42000 ortholog gene families were constructed from these genomes,
to find positive selected genes,
I use codeml M2a vs M1a to test (df=2) for positive selection,

using a FDR cutoff of 0.01, I got ~16000 (38%) genes that are under positive selction,
does codeml suitable for our dataset (genes come from different strains of the same species)?

the control file i specified is:

seqfile = input.seq
treefile = input.tree
outfile = mlc

noisy = 3
verbose = 1
runmode = 0

seqtype = 1
CodonFreq = 2
clock = 0
model = 1 2

NSsites = 2
icode = 0
fix_omega = 0
omega = .9
fix_kappa = 0
kappa = .3
cleandata = 1``

paml codeml selection • 4.9k views

ADD COMMENT • link updated 11.8 years ago by Rahul Sharma ▴ 660 • written 11.8 years ago by Plantae ▴ 390

0

Entering edit mode

How did you obtain individual CDS from 14000 orthologs after assembling the genomes? I'm trying to do something similar using codeml, but I only know how to obtain single genes at a time. Recently I obtained genome data so I would like to obtain all CDS and apply codeml to all of them - would you mind sharing a bit of how you process such a vast number? I use bwa mem for my assembly.

ADD REPLY • link 5.3 years ago by DNAngel ▴ 250

score 4 · Answer 1 · 2013-02-27

4

Entering edit mode

11.8 years ago

Giovanni M Dall'Olio 28k

Have a look at the alignments. Most of the times, weird dN/dS scores are caused by errors in the alignment.

ADD COMMENT • link 11.8 years ago by Giovanni M Dall'Olio 28k

score 2 · Answer 2 · 2013-02-27

2

Entering edit mode

11.8 years ago

Bioch'Ti ★ 1.1k

Hi, I agree with Giovanni, but you should also check for paralogs... which will inflate the ratio. Good luck

ADD COMMENT • link 11.8 years ago by Bioch'Ti ★ 1.1k

score 2 · Answer 3 · 2013-03-01

2

Entering edit mode

11.8 years ago

aidan-budd 1.9k

This article by Will and Ziheng I think does a good job at exploring and highlighting common (alignment-based) sources of error for analyses of this kind. It would probably be useful for you to look through it.

ADD COMMENT • link 11.8 years ago by aidan-budd 1.9k

score 1 · Answer 4 · 2013-02-28

1

Entering edit mode

11.8 years ago

jprmachado ▴ 80

Hi,

agree with both previous replies. Consider also to filter de alignment. Try Gblocks for example, but there is more tool such as Gblocks available

Good luck

ADD COMMENT • link 11.8 years ago by jprmachado ▴ 80

score 1 · Answer 5 · 2013-03-01

1

Entering edit mode

11.8 years ago

Rahul Sharma ▴ 660

HI,

Finding positively selected genes is very tricky. As you have mentioned the site model (M2 and M1) comparison, have you tried other models M8 and M7? I would also try the Branch-site model (Test2) of codeml and later statistics with LRT, BC and FDR. Then would use only those genes, which are having at least one positively selected site with BEB confidence >95% or >99%. In my analysis, I first used the prank-codon alignments for MSA of the orthologs. Branch site model --> LRT, BC and FDR(Picked genes with 1% FDR) ---> Check the genes having atleast one site with >99% BEB site --> Finally got 6%, 4%, 3% and 2% of positively selected genes in four genomes. This paper I found very interesting: http://petrov.stanford.edu/pdfs/77.pdf

Regards, Rahul

ADD COMMENT • link 11.8 years ago by Rahul Sharma ▴ 660

0

Entering edit mode

Sorry, but what is BC?

  Sincerely,
         Kang

ADD REPLY • link 7.9 years ago by dukecomeback ▴ 40

0

Entering edit mode

I asked users above to but how do you obtain all the protein-coding sequences after assembling the genome? I want to try something similar to you all with codeml, but so far I only know how to assemble my genome data using a CDS reference sequence...

ADD REPLY • link 5.3 years ago by DNAngel ▴ 250