Question

How to convert Plink's ped/map files into nexus format?

0

Entering edit mode

10.5 years ago

A.Illum • 0

Dear all,

I need to make a phylogenetic network using unphased Illumina genotyping data stored in Plink's ped/map format. Phylogenetic programs like SplitsTree require input files to be in nexus format. How can I convert ped/map into nexus format? Should I also phase the data?

I am a beginner in this field so any help much appreciated.

Ann

alignment next-gen phylogeny sequencing • 5.8k views

ADD COMMENT • link updated 21 months ago by gerardomartjau • 0 • written 10.5 years ago by A.Illum • 0

Ram · Answer 1 · 2014-09-10

Hi,

Here is a python function that converts a ped file to a nexus file as used by SNAPP. I am not sure about the specific requirements of SplitTree, but you can try.

It assumes that your ped is encoded with 12 instead of ACGT. You obtain such a ped from a standard ped with

plink --file <in_filename> --recode12 --out <out_filename>

then run within python:

ped_to_nexus(ped_fn, nexus_fn, n_individuals=<number of individuals>)

The function is attached below.

Best,
Hannes

def ped_to_nexus(ped_fn, nexus_fn, individual_names=None, n_individuals=None,
                                 max_name_len = 12, comments=None,dominant=False):
    """
    Write snps from ped file to nexus alignment file.
    Attention: The bed file has to be encoded with:
                0.. missing
                1.. first allele 
                2.. second allele
            Such a ped file can be obtained from a regular ped file
            with "plink --file <in_filename> --recode12 --out <out_filename>"
    This function assumes that individuals are diploid. 
    The two haploid columns in the ped become a single genotype (0/1/2) 
        in the nexus file.
    -------Parameters---------
    individual_names... list of individual (or taxa) names
                        if not supplied, the names from the bed file are used
    n_individuals... number of individuals (ntax in the nexus format)
                     (if not supplied, the length of indiviudal_names is used)
    max_name_len... maximal number of characters of individual names, 
                    used to determine intent of alignments
    comments... list of strings that are added to the nexus header as comments
                alternatively, it can be a single string
    dominant... if True, the 0 allele is assumed to be dominant and the resulting
                nexus file is encoded 0/1 instead of 0/1/2
    """
     #
    assert individual_names is not None or n_individuals is not None,\
            "specify either individual names or the total number of individuals" 
    if individual_names is not None and n_individuals is not None:
        assert len(individual_names) == n_individuals,\
                "number of individual names must equal argument n_individuals"
    if n_individuals is None:
        n_individuals = len(individual_names)
    if individual_names is None:
        individual_names = [None]*n_individuals

    with open(ped_fn,"r") as bf, open(nexus_fn,"w") as nf:
        nf.write("#NEXUS\n")
        if comments is not None:
            if not isinstance(comments, basestring):
                for comment in comments:
                    nf.write("[{}]\n".format(comment))
            else:
                nf.write("[{}]\n".format(comments))

        for i,(line,name) in enumerate(zip(bf,individual_names)):
            if i >= n_individuals:
                break
            fields = line.split()
            if name is None:
                name = fields[0]
            n_whitespace = max_name_len + 1 - len(name)
            assert n_whitespace >= 1, "max taxon name length is " \
                                    + str(max_name_len) + " but name is " + name
            data = fields[6:]

            if i == 0:
                if dominant:
                    symbols = "01"
                else:
                    symbols = "012"
                nf.write("\nBegin data;\n"
                 "        Dimensions ntax={} nchar={};\n"
                 '        Format datatype=integerdata symbols="{}" missing=? gap=-;\n'
                 "        Matrix\n".format(n_individuals,len(data)/2,symbols))
            nf.write(name)
            nf.write(" " * n_whitespace)
            for i in xrange(0,len(data),2):
                h0 = int(data[i])
                h1 = int(data[i+1])
                if h0!=0 and h1!=0:
                    gt = h0 + h1 - 2
                    if dominant:
                        #assume 0 allele is dominant
                        gt = int(gt/2)
                    gt = str(gt)
                else:
                    gt = "?"
                nf.write(gt)
            nf.write("\n")
        nf.write("        ;\nEnd;")