I would like to create phylogenetic trees for 50 patients with breast cancer based purely on SNP VAFs as I don't have copy number variations. Although I've been having several issues with patients having less than 10 mutations, there are many examples with ostensible VAF differences where sciClone keeps returning just one cluster. Here's an example
https://www.sendspace.com/filegroup/Wn8S9o8HzjkcUaLzTpR%2BIQ
Do you have any ideas on what is the default/lowest minimum depth of the software ? I've also read in another paper (PhyloSub) that choosing another platform might increase potential number of clusters you get, so I am interested in anything that someone has to suggest that works on the above data.
First of all, your variant files need to be reformatted. VAFs should be expressed as percentages (0-100), not fractions (0-1). That's not helping things
There are several other oddities about your variant files:
1) There's no way to have a decimal number of reference-supporting reads
17 41209079 1.791 195 0.1
2) Why do you have exactly 1000 reads for all of these sites? Your variant lists need to be merged, such that you get readcounts for each variant in each bam, regardless of whether it was called or not. You may have (for example) a cluster that's present at 2-10% in one sample and at 35% in the other, but wouldn't have been identified in the former, since callers have limited low-VAF sensitivity.
Thanks for the response, for some reason my obsolete software for mutation calling gives me wrong outputs. Fixed that and the program finally gave me 3 clusters. Apart from the computational time, there are some patients with six samples (bilateral primary with metastatic sites) where I cannot figure how to assign clusters across all the 5-6 files and plots are not available for more than 3 VAF entries.
Your variant lists need to be merged, such that you get readcounts for each variant in each bam, regardless of whether it was called or not.
So what you are saying is that if I have 800 variants in the primary tumor, and 1000 in the relapse tumor, I have to merge them so that I end up with 1000 variants in both primary and tumor? Then I need to include the number of reads for the samples where the variants are not present, I suppose it is not enough to say Variant allelic frequency = 0 at that position.
Exactly. There may be variants present at 3% that were not called in one samples, but since you're able to see them clearly at 40% the other tumor, you can infer that they were present.
DP4 is a tag to access the number of reads in the output file. It gives the number of reference reads in the forward strand, number of reference reads in the reverse strand, number of alternative reads in forward strand, and number of alternative reads in reverse strand.
I used these numbers in the postprocessing to generate the new vcf files for my individivual samples.
Edit: DP4 is a tag in an older version of samtools (1.1), but if you read the documentation you should figure out how to extract the number of reads.
Thanks for the response, for some reason my obsolete software for mutation calling gives me wrong outputs. Fixed that and the program finally gave me 3 clusters. Apart from the computational time, there are some patients with six samples (bilateral primary with metastatic sites) where I cannot figure how to assign clusters across all the 5-6 files and plots are not available for more than 3 VAF entries.
So what you are saying is that if I have 800 variants in the primary tumor, and 1000 in the relapse tumor, I have to merge them so that I end up with 1000 variants in both primary and tumor? Then I need to include the number of reads for the samples where the variants are not present, I suppose it is not enough to say Variant allelic frequency = 0 at that position.
Exactly. There may be variants present at 3% that were not called in one samples, but since you're able to see them clearly at 40% the other tumor, you can infer that they were present.
what tools are you using for getting readcounts for each variant in each bam? I saw bam-readcount https://github.com/genome/bam-readcount
I used samtools mpileup and provided the list of positions based on the merged vcf files for the two samples.
samtools mpileup -f myreferencefastafile.fasta -s merged_variants_file.txt path/to/bam/file.bam -o output.vcf -v -t DP4
DP4 is a tag to access the number of reads in the output file. It gives the number of reference reads in the forward strand, number of reference reads in the reverse strand, number of alternative reads in forward strand, and number of alternative reads in reverse strand.
I used these numbers in the postprocessing to generate the new vcf files for my individivual samples.
Edit: DP4 is a tag in an older version of samtools (1.1), but if you read the documentation you should figure out how to extract the number of reads.
Thanks, this is very helpful
for the mpileup output
what's does the <*> mean? and the PL has 6 combinations.
Thanks.
We typically use bam-readcount, but any tool that gives you the desired info is fine.