Question

New graduate student in Computational Biology, confused by reads versus copy number!

0

Entering edit mode

3.1 years ago

Crystal • 0

I'm looking at what the average number of mitochondrial genomes per cell a sample. This is drosophila melanogaster genome. I'm using samtools. So I'm doing command:

samtools view -c -q20 mapped-sorted.bam mitochondrion_genome

to get how many reads map to the mitochondrion genome that are over a quality of 20

I'm assuming the reads are equal to the number of mitochondrial genomes. My only concern is since if we account for the pairs for the reads? And how would we do that if that was the cause!

samtools BAM SAM BWA • 1.1k views

ADD COMMENT • link updated 22 months ago by Ram 45k • written 3.1 years ago by Crystal • 0

score 1 · Answer 1 · 2022-02-27

1

Entering edit mode

3.1 years ago

Istvan Albert 102k

to estimate copy number changes you'll need to relate the number of reads that come from the unique regions to the number that comes from the non-unique regions.

Imagine it like so, when the overall coverage is N then every DNA region produces reads that cover each base N times (on average).

If some of the DNA regions are present K times, each produces reads with N coverage. But if the genome lists that region only a single time, then, all that data producing K*N coverage will "look like" it comes from that single region.

Hence the coverage over the region present K times will be K*N whereas the unique regions will have an average coverage of N.

ADD COMMENT • link 3.1 years ago by Istvan Albert 102k

0

Entering edit mode

So the length of my mitochondria genome is 19,303. using

samtools view -c -f16 mappedsubset-sorted.bam mitochondrion_genome

I get 370 reads for the reverse strand, and using

samtools view -c -F16 mappedsubset-sorted.bam mitochondrion_genome

I get 359 reads for the forward pair.

Making it 729 reads in total. So what you're saying is to divide mitchondrial length by the amount of reads to get an approximation of the genome? coverd? Sorry, not quite at the level you're at!

ADD REPLY • link 3.1 years ago by Crystal • 0

0

Entering edit mode

forward and reverse alignments are not directly related to copy number variation.

I was specifically talking about reads that come from regions that are unique vs regions that are present with multiple (and unknown) copies. In that case, that unknown number of copies may be inferred from the coverage.

If you are brand new to the field I would suggest learning more about the basics of short-read alignments.

Copy number variation is one of the more difficult subjects because the interpretation of the data is more subjective and may be beset by all kinds of challenges.

ADD REPLY • link 3.1 years ago by Istvan Albert 102k

score 1 · Answer 2 · 2022-02-27

Im assuming the reads are equal to the amount of genome there are

Most definitely not. Don't worry about counting pairs because you will be off much more than only two-fold.

It partially depends on how the library for sequencing was generated. If the method involved PCR then that would inflate the reads relative to the number of fragments that went into the prep. But more important, your lab method would start with DNA extraction from thousands to millions of cells (again, depending on the protocol), and with every step (e.g. beads purification) you would lose some of the original material. And at many steps, you will measure the DNA concentration and continue with the appropriate mass. Not all of the library that is generated is then sequenced.

So no, by no means the number of reads is a useful approximation of the copy number of DNA that you originally had.