Question

Only one cluster in multiple tumor phylogeny samples

1

Entering edit mode

8.6 years ago

ceruleanivy ▴ 50

I would like to create phylogenetic trees for 50 patients with breast cancer based purely on SNP VAFs as I don't have copy number variations. Although I've been having several issues with patients having less than 10 mutations, there are many examples with ostensible VAF differences where sciClone keeps returning just one cluster. Here's an example https://www.sendspace.com/filegroup/Wn8S9o8HzjkcUaLzTpR%2BIQ

Do you have any ideas on what is the default/lowest minimum depth of the software ? I've also read in another paper (PhyloSub) that choosing another platform might increase potential number of clusters you get, so I am interested in anything that someone has to suggest that works on the above data.

R next-gen sequencing SNP sciClone • 3.3k views

ADD COMMENT • link updated 8.6 years ago by Chris Miller 22k • written 8.6 years ago by ceruleanivy ▴ 50

score 2 · Answer 1 · 2016-04-28

2

Entering edit mode

8.6 years ago

Chris Miller 22k

First of all, your variant files need to be reformatted. VAFs should be expressed as percentages (0-100), not fractions (0-1). That's not helping things

There are several other oddities about your variant files:

1) There's no way to have a decimal number of reference-supporting reads

17  41209079  1.791 195 0.1

2) Why do you have exactly 1000 reads for all of these sites? Your variant lists need to be merged, such that you get readcounts for each variant in each bam, regardless of whether it was called or not. You may have (for example) a cluster that's present at 2-10% in one sample and at 35% in the other, but wouldn't have been identified in the former, since callers have limited low-VAF sensitivity.

17  41276047  1000  0 0
17  41197799  1000  0 0
17  41209140  1000  0 0

3) Take a look at this image from a recent paper of ours (http://www.cell.com/cell-systems/fulltext/S2405-4712(15)00113-1). The downsampling here gives you a pretty good idea of how depth affects the ability to distinguish individual clusters

ADD COMMENT • link 8.6 years ago by Chris Miller 22k

0

Entering edit mode

Thanks for the response, for some reason my obsolete software for mutation calling gives me wrong outputs. Fixed that and the program finally gave me 3 clusters. Apart from the computational time, there are some patients with six samples (bilateral primary with metastatic sites) where I cannot figure how to assign clusters across all the 5-6 files and plots are not available for more than 3 VAF entries.

ADD REPLY • link 8.6 years ago by ceruleanivy ▴ 50

0

Entering edit mode

Your variant lists need to be merged, such that you get readcounts for each variant in each bam, regardless of whether it was called or not.

So what you are saying is that if I have 800 variants in the primary tumor, and 1000 in the relapse tumor, I have to merge them so that I end up with 1000 variants in both primary and tumor? Then I need to include the number of reads for the samples where the variants are not present, I suppose it is not enough to say Variant allelic frequency = 0 at that position.

ADD REPLY • link 8.2 years ago by cbst ▴ 160

0

Entering edit mode

Exactly. There may be variants present at 3% that were not called in one samples, but since you're able to see them clearly at 40% the other tumor, you can infer that they were present.

ADD REPLY • link 8.2 years ago by Chris Miller 22k

0

Entering edit mode

what tools are you using for getting readcounts for each variant in each bam? I saw bam-readcount https://github.com/genome/bam-readcount

ADD REPLY • link 8.1 years ago by Ming Tommy Tang ★ 4.5k

3

Entering edit mode

I used samtools mpileup and provided the list of positions based on the merged vcf files for the two samples.

samtools mpileup -f myreferencefastafile.fasta -s merged_variants_file.txt path/to/bam/file.bam -o output.vcf -v -t DP4

DP4 is a tag to access the number of reads in the output file. It gives the number of reference reads in the forward strand, number of reference reads in the reverse strand, number of alternative reads in forward strand, and number of alternative reads in reverse strand.

I used these numbers in the postprocessing to generate the new vcf files for my individivual samples.

Edit: DP4 is a tag in an older version of samtools (1.1), but if you read the documentation you should figure out how to extract the number of reads.

ADD REPLY • link 8.1 years ago by cbst ▴ 160

0

Entering edit mode

Thanks, this is very helpful

ADD REPLY • link 8.1 years ago by Ming Tommy Tang ★ 4.5k

0

Entering edit mode

for the mpileup output

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  SAMPLE
1       10234   .       C       T,<*>                PL:DP:SP:ADF:ADR:AD     0,2,171,69,183,207:27:10:11,4,0:12,0,0:23,4,0

what's does the <*> mean? and the PL has 6 combinations.

Thanks.

ADD REPLY • link 8.1 years ago by Ming Tommy Tang ★ 4.5k

1

Entering edit mode

We typically use bam-readcount, but any tool that gives you the desired info is fine.

ADD REPLY • link 8.1 years ago by Chris Miller 22k