Question

Concatenate SNPs for phylogenetic

1

Entering edit mode

10.7 years ago

e.martinez ▴ 10

Dear All,

I have a list of SNPs from different samples, which present common and different SNPs in respect to a common ancestor. I'm interest to create concatenate files of the SNPs for phylogenetic studies. The problem is that I have samples that don't present a SNPs in a particular position and I need to fill this gap with the reference nucleotide so I obtain perfect concatenate files. This is because if not phylogenetic programs ignore complete this position. I have done it through lockup in excel but when you have more than 30,000 is quite time consuming. Some one knows a better way to do it?

SNP alignment • 5.8k views

ADD COMMENT • link updated 3.4 years ago by Ram 45k • written 10.7 years ago by e.martinez ▴ 10

0

Entering edit mode

Can you show us what your snp list looks like? Sounds like you need to write a simple script, but users will need details to provide a helpful answer

ADD REPLY • link 10.7 years ago by David W 4.9k

0

Entering edit mode

When I extract my snps from vcf files into excel they look like this

1849      SNV     C     A
2532      SNV     C     T
9143      SNV     C     T
11820     SNV     C     G

for one isolate, and for following isolates the positions of the snps can differ

2532      SNV     C     T
2586      SNV     G     T
9143      SNV     C     T
11370     SNV     C     T

If I do a multi-alignment in Mega it looks like below, were there is not snp it will be a gap. This reduce my analysis as mega ignore columns with gaps. I need to fill the gap with the ref nucleotide in a better way that lock up in excel. Any ideas?

ADD REPLY • link updated 3.4 years ago by Ram 45k • written 10.7 years ago by e.martinez ▴ 10

0

Entering edit mode

ACACAGGGGCCCGCGAACCCAGCGCGGCCAA
ACACAGGG-CTCGCGAAGCCAGCGCTGCCAA
ACAGAGGGTCCCGCGAAGCCAGCGCTGCCAA
ACACAGGGTCTCGCGAACCCAG-GCTGCCAA

ADD REPLY • link updated 3.4 years ago by Ram 45k • written 10.7 years ago by e.martinez ▴ 10

0

Entering edit mode

So, you can probably skip the excel stage (always a good idea!) and use one of these solutions New Fasta Sequence From Reference Fasta And Variant Calls File?

ADD REPLY • link updated 3.4 years ago by Ram 45k • written 10.7 years ago by David W 4.9k

0

Entering edit mode

Thanks David. The link have a variety of solutions that by sure will help me.

ADD REPLY • link 10.7 years ago by e.martinez ▴ 10

0

Entering edit mode

I try the solutions in the link, and they are good, but it finishing giving me the whole genome. If I try to do multi alignment and run in beast it will take forever. Thus, I was using only a list of positions from my reference and then find it them in the vcf files from my samples, which imply that I still don't have a better way to do it than in excel. Any other suggestions. Thanks.

ADD REPLY • link 10.7 years ago by e.martinez ▴ 10

0

Entering edit mode

You have a few options that I can see:

Take your massive alignment and drop non-variables sites (some MSA viewers do this, it wouldn't be hard to script it if you're dealing with very large genomes)

Use a "combine variants" tools from bcftools/gatk/picard to make a single VCF for all your variants, extract the sites from the vcf

Use bedtools intersect to create a set of polymorphic sites, use the resulting bed file to extract only the variable sites from the reference genome

Which is best will depend on what you've already done and what tools you are confident with, but it should be possible

ADD REPLY • link updated 3.4 years ago by Ram 45k • written 10.7 years ago by David W 4.9k

Ram · Answer 1 · 2014-07-31

1

Entering edit mode

10.7 years ago

Cytosine ▴ 460

If samples are really related (e.g. strains from the same organism), why not try a multi-sequence alignment on the concatenated snp sequences, without filling the gaps with reference nucleotides?

ADD COMMENT • link updated 3.4 years ago by Ram 45k • written 10.7 years ago by Cytosine ▴ 460

0

Entering edit mode

I agree with the response from Cytosine. I would add that it might be useful to add a length of reference sequence to each end of the concatenated SNPs in order to guide the alignment and have no overhangs or unaligned portions.

ADD REPLY • link updated 3.4 years ago by Ram 45k • written 10.7 years ago by Larry_Parnell 16k