Roary producing erroneous plots?
1
0
Entering edit mode
5.7 years ago
Lesley Sitter ▴ 610

Hi there, been using Roary for 3 years now... awesome tool!

I'm just curious how it is that all the plots it produces aren't single values? For example... the "Number of unique genes" plot should just be a fixed single value of the number of unique genes am i right? So how come it's a collection of whisker plots per genome, that sometimes even has outlier points? How can it have a range of values per genome unless it calculates it for different Nucleotide Identity values or something... It's not as if it also produced a range of gene_presence_absence.csv files.

I'm just curious if anyone knows what these plot ranges are based on

EDIT: i looked all over, but there is no explenation anywhere where the ranges come from. So ended up writing an extensive R script that just calculates it for you based on the gene_abscence_presence.csv file... but would still be interested in knowing the reason for this weird output if anyone knows

Roary R • 2.0k views
ADD COMMENT
0
Entering edit mode

I've used roary a fair bit. None of the plots seemed unusual to me, but I'm struggling to picture them now. Can you show an example of the plot you mean specifically?

ADD REPLY
0
Entering edit mode

For example, this is the default plot you get for New genes per genome... it's a whisker plot, meaning that each genome has a "range" of new genes... which is off course totally absurd unless there is some sort of "threshold" through which Roary analyzes these new genes (for example on a range of different Identity scores) New genes per genome

But the troubling one for me was this one, the "unique" genes per genome plot... Unique is singular... so i'm really confused where this range comes from Unique genes

ADD REPLY
0
Entering edit mode

And when i convert the gene_abscence_prescence.csv to binary matrix, score the number of rows that have single entries. then count the number of rows per genome that belong to a orthologous group with only 1 entry... my plot looks nothing like this... so even the values it should represent based on the absence presence matrix, are not in these plots :S

enter image description here

ADD REPLY
1
Entering edit mode
5.7 years ago
Joe 21k

I spoke to Andrew, the lead dev for the tool.

The reason they are box plots is because depending on which order you consider each new genome, the impact on the size of the core/accessory is different. So, all the genomes are randomly sampled N times, and the impact they have on the plots shown as a box plot/distribution.

ADD COMMENT

Login before adding your answer.

Traffic: 1951 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6