Dear All,
I have a list of SNPs from different samples, which present common and different SNPs in respect to a common ancestor. I'm interest to create concatenate files of the SNPs for phylogenetic studies. The problem is that I have samples that don't present a SNPs in a particular position and I need to fill this gap with the reference nucleotide so I obtain perfect concatenate files. This is because if not phylogenetic programs ignore complete this position. I have done it through lockup in excel but when you have more than 30,000 is quite time consuming. Some one knows a better way to do it?
When I extract my snps from vcf files into excel they look like this
for one isolate, and for following isolates the positions of the snps can differ
If I do a multi-alignment in Mega it looks like below, were there is not snp it will be a gap. This reduce my analysis as mega ignore columns with gaps. I need to fill the gap with the ref nucleotide in a better way that lock up in excel. Any ideas?
So, you can probably skip the excel stage (always a good idea!) and use one of these solutions New Fasta Sequence From Reference Fasta And Variant Calls File?
Thanks David. The link have a variety of solutions that by sure will help me.
I try the solutions in the link, and they are good, but it finishing giving me the whole genome. If I try to do multi alignment and run in beast it will take forever. Thus, I was using only a list of positions from my reference and then find it them in the vcf files from my samples, which imply that I still don't have a better way to do it than in excel. Any other suggestions. Thanks.
You have a few options that I can see:
Take your massive alignment and drop non-variables sites (some MSA viewers do this, it wouldn't be hard to script it if you're dealing with very large genomes)
Use a "combine variants" tools from bcftools/gatk/picard to make a single VCF for all your variants, extract the sites from the vcf
Use bedtools intersect to create a set of polymorphic sites, use the resulting bed file to extract only the variable sites from the reference genome
Which is best will depend on what you've already done and what tools you are confident with, but it should be possible