Question

Importance Of Consistency Of Downstream Analysis Of Sequence Data

2

Entering edit mode

14.0 years ago

Pi ▴ 520

Greetings

When a lab performs an investigation to sequence a population of individuals, is it typically the case that every individual in the population will be sequenced using the same platform (e.g. solid/illumina). I am wondering if cases exist where some individuals in an investigation are sequenced using a different platform.

I am assuming all individuals would be sequenced using the same instrument for consistency. But then you also have to assume all post-sequencing analysis is the same (e.g. the reads are assembled with the same pipeline) if you want consistency?

My point of asking this is because I am interested whether it is considered 'acceptable' to treat the individuals in a population differently because of how it affects subsequent calculations such as allele and genotype frequencies.

I've never worked directly on a sequencing project and have only been given the data to analyse after all this work has been done. The prior steps affect data quality (e.g. some pipelines are considered noisier than others) so it must affect the quality of the subsequent calculations. Or are the figures for allele and genotype frequencies just to imprecise for this to matter?

So to summarise, how you can assess the quality of a variation study if all data wasn't gathered using the same protocol. Are there guidelines for this?

Thank-you for your time

edit: Thanks for your answers. With regard to guidelines I was also wondering if there were guidelines for how to describe the pipeline used to sequence data and the associated parameters/thresholds that may vary with the pipeline or is it just a case documenting it? Is there a need among the community for such guidelines if they don't exist or is general documentation sufficient

sequence variation quality • 3.1k views

ADD COMMENT • link updated 13.7 years ago by lh3 33k • written 14.0 years ago by Pi ▴ 520

score 4 · Answer 1 · 2011-04-26

4

Entering edit mode

14.0 years ago

Casey Bergman 18k

Unless their is substantial platform-dependent systematic sequencing error (as a hypothetical example, Solid preferentially generates a high rate of C->T mistakes) then my intuition is that the variance on the evolutionary process due random effects of mutation, genetic drift, and sampling will lead greater variation than that observed from individuals being sequenced on different platforms. This is of course speculation, and would need to be evaluated empirically. As a start, you could read the Harismendy et al paper carefully to see if there is anything in there that suggests platform specific systematic errors that are greater than the mutation rate of your organism.

ADD COMMENT • link 14.0 years ago by Casey Bergman 18k

1

Entering edit mode

good point: human variability may be higher than the differences introduced by different platforms. I'll have to read about that, so thank you for the paper suggestion. addendum: I would say that using the same software pipeline would be then mandatory, in order not to introduce more differences on the process.

ADD REPLY • link 14.0 years ago by Jorge Amigo 14k

0

Entering edit mode

@Jorge, I 100% agree about using the same analysis pipeline - at least there are some things that are in our control to standardize as bioinformaticians.

ADD REPLY • link 14.0 years ago by Casey Bergman 18k

score 2 · Answer 2 · 2011-04-26

sure using the same platform/protocol/pipeline would be desirable in order to minimize possible existing errors, but some big projects just can't afford it (take a look to 1000 Genomes for instance). the normalization process becomes then the main challenge, the most important one in my honest opinion, since once you merge results from different platforms you forget about which one was more error prone, which one was more lax on SNP calling, ...

as far as I know there are no guidelines written yet on this matter, although it is a general consensus among people I've talked to that are working with NGS data coming from different sources that at least the default thresholds for each platform must be slightly raised before putting all the results on the table, plus the normalization of some basic experiment's variables such as coverage, base quality or variant density.

EDIT: after reading Casey Bergman's answer I realized I maybe didn't myself completely clear. it is true that differences introduced by different platforms may not be as significant as the intrinsic differences among samples just due to human variability (this makes sense to me, so I'll fetch some readings like the one suggested by Casey), but the way those variants are detected may vary when using certain software. the suggestion I was trying to come here with is to try using the same software pipeline for all platforms, so you can be confident at least on the algorithms and the stringencies imposed to the results, which would be shared among all platforms' results. since no all the mapping and variant calling tools work with raw data coming from all different platforms you will probably have to do some effort converting raw data into some common format (fastq for instance), so some quality check to make sure things have gone fine at the lab, and then you should be able to process everything being relatively confident.

Ram · Answer 3 · 2011-04-26

1

Entering edit mode

14.0 years ago

lh3 33k

ALWAYS try to use the same technology for consistent results. While there are fluctuations between individuals, those are unbiased. Artifacts caused by using different technologies are biased, which are far more hurting. I would not recommend Harismendy et al paper. It was good at the time of publication, but is now outdated. I have seen a couple of papers/manuscripts misled by that.

ADD COMMENT • link 14.0 years ago by lh3 33k

0

Entering edit mode

Can you provide evidence to support your claim of misleading results based on the Harismendy et al paper?

ADD REPLY • link 14.0 years ago by Casey Bergman 18k

0

Entering edit mode

I reviewed two manuscripts that used the data set in the Harismendy et al paper. Both got rejected (not for using the data in Harismendy et al of course). One manuscript assumed the base error rate in Harismendy et al is typical, but the error rate is quite high in today's standard. The other manuscript assumed targeted sequencing and whole genome resequencing lead to the same results. Note that in both cases, there is nothing wrong with Harismendy et al itself.

ADD REPLY • link 14.0 years ago by lh3 33k

0

Entering edit mode

I reviewed two manuscripts that used the data set in the Harismendy et al paper. Both got rejected (not for using the data in Harismendy et al of course). One manuscript assumed the base error rate in Harismendy et al is typical, but the error rate is quite high in today's standard. The other manuscript assumed targeted sequencing and whole genome resequencing lead to the same results. Note that in both cases, there is nothing wrong with Harismendy et al itself. Just nowadays, it is not the most representative data set.

ADD REPLY • link 14.0 years ago by lh3 33k

0

Entering edit mode

Thanks, Heng. Is there a better publication on error rates in NGS? Evaluation Of High Throughput Sequencing Error Rates ?

ADD REPLY • link updated 5.6 years ago by Ram 45k • written 14.0 years ago by Casey Bergman 18k