Getting data into Plink! and TreeMix
5
1
Entering edit mode
10.2 years ago
hermathena ▴ 40

Dear All,

I am trying to get non-model data into the software TreeMix. I have data from multiple species, 1-10 individuals for each. I start with a VCF of biallelic SNPs from GATK and generate Plink input with vcftools:

vcftools --vcf exon.biallelicSNPs.vcf --plink --out treemix.plink --keep some_individuals.txt

I then run Plink, using a cluster file that assigns individuals to their species:

plink --file treemix.plink --freq --noweb --missing --within treemix.data.clust

Several warnings like this are produced - not sure what they really mean:

Duplicate marker name found: [ 1347331 ]

Anyway, there is a plink.frq.strat output file. I gzip that and use the auxiliary script of TreeMix to make their input files. I get the error:

Traceback (most recent call last):
  File "plink2treemix.py", line 24, in <module>
    total = line[7]
IndexError: list index out of range

Hmmmm. Really not sure why! If anyone has experience of getting data into treeMix, especially for non-human taxa where the earlier steps may have been the problem, I would be very grateful!!!

Many thanks,

genome next-gen SNP • 16k views
ADD COMMENT
1
Entering edit mode

hi, have you solved this problem ? I have same problem. Thank you

ADD REPLY
0
Entering edit mode

I have managed to use TreeMix. The way I did it is that I had illumina reads from different isolates of the same species, so I mapped each of my library to the same reference individually and called SNPs using the PoPoolation toolkit. I then had quick script to change the PoPoolation output to the TreeMix standard.

ADD REPLY
0
Entering edit mode

Hi all,

I have been the same problem, though the error is:

Traceback (most recent call last):
  File "./plink2treemix.py", line 23, in <module>
    mc = line[6]

Did anybody solve it?

Felipe

ADD REPLY
0
Entering edit mode

hi, have you solved this problem? i also have the same prolem. Could you give me some suggestions, Thank you!

ADD REPLY
1
Entering edit mode
7.8 years ago
hermathena ▴ 40

Hi, Plink v. 1.9 does a very good job. You can ask it to read in vcf and --make-bed:

/bin/plink2/plink --vcf input.vcf --make-bed --geno 0.25 --maf 0.01 --snps-only --out input.qual --allow-extra-chr --within my_clusters.clust

Unfortunately the resulting .bim file has periods instead of SNP names, which can be a problem for LD pruning etc. It can be easily fixed with awk:

awk '{$2 = $1":"$4; print}' input.bim > snp_names.bim

After that you can filter in Plink, and follow Joe Pickrell's instructions - make a frequencies file, re-format for TreeMix. Glitches like the one above can be fixed manually, occasionally there is a single incorrect line - presumably you'll have enough data that one removed SNP wouldn't matter (unless you are looking for associations, etc).

ADD COMMENT
0
Entering edit mode

Hi kmkozak

Can you detail your answer here by providing all the command line you've been using? I'm currently struggling to convert my bed/map files into Treemix inputs using the dedicated 'plink2treemix.py' software provided onto the Treemix webpage.

ADD REPLY
0
Entering edit mode
6.7 years ago
Gabriel R. ★ 2.9k

I posted this to another question, it is one line in glactools:

glactools bplink2acf --fai reference.fai [PLINK prefix] |  glactools acf2treemix -  |gzip > treemix.gz
ADD COMMENT
0
Entering edit mode
6.6 years ago

Hi,

I just posted a similar reply in researchgate. I will copy it here. Unfortunately I could not use glactools from Gabriel because I have no reference.fai (my vcf was generated with STACKS denovo).

So, I started with a .vcf file and converted it to the custom plink format. Treemix requires a specifically formatted file. The website says that in order to create this file you need to run:

$ plink --bfile data --freq --missing --within data.clust 
$ gzip plink.frq
$ plink2treemix.py plink.frq.gz treemix.frq.gz

The first codeline ($1) requires a .clst file. .clst files are generated with --write-cluster --family and it consists in three columns including 1. Family ID; 2. ID; 3. cluster. I used several awk commands to format the information properly.

The first codeline ($1) produces a .frq file which according to plink2's webiste includes:

I. CHR [Chromosome code]
II. SNP [Variant identifier]
III. A1 [Allele 1 (usually minor)]
IV. A2 [Allele 2 (usually major)]
V. MAF [Allele 1 frequency]
VI. NCHROBS [Number of allele observations]

So, here start the problems. If you open the provided python file:

$ nano plink2treemix.py

You will notice that the script requires that the file has 8 columns (remember that python starts counting in 0. mc = line[6], total = line[7] are columns 7 and 8 respectively). I think this might be a problem with the plink versions. I was using plink2 (v1.9) and the program was written when plink1 (v1.07) was around. So if you hit with the following error:

ERROR: treemix mc = line[6] indexerror: list index out of range

I expect it to be a mismatch between plinks versions.

So then I switched to plink1 and it complained that the SNPs had similar names. Because my data was generated with STACKS denovo, the SNPs do not have a position in the genome and <mydata.bim> had . for SNP. I used awk to solve this:

$ awk '{$2 = $1"_"$4; print}' mydata.bim > mydata.SNPrenamed.bim ################ so, we change the second column ($2) to be a combination including column 1, an underscore and column 4 ( $1"_"$4 ).

Plink kept complaining so I decided to take a new approach.

I noticed that the POPULATIONS mode as part of STACKS 1 had a treemix flag.

I created a population map for stacks, which is a file composed of two columns, the first one with individuals and the second with populations:

IND_01 * TAB * POP_A
IND_02 * TAB * POP_A
IND_100 * TAB * POP_C

So I just went on STACKS v1.48 and did:

/path/to/stacks-1.48/bin/populations --in_vcf mydata.vcf --treemix -O ./ -M pop_map.tsv

--in_vcf : input vcf
--treemix : generate a treemix output
-O : output folder
-M : population map

And it worked!

ADD COMMENT
1
Entering edit mode

Which is the correct output you get?

ADD REPLY
0
Entering edit mode
6.2 years ago

Hi,

I had the same problem. I began with my .vcf from STACKS 1.48, de novo pipeline.

I used vcftools to count alleles at each population (I have six), with this command:

vcftools --vcf batch_23001.vcf --counts --keep AG1 --out AG1 (x6 different populations)

AG1 after --keep is a popmap belonging to one population with this format:

Sample1
Sample2
...

Finally, with these vcftools outputs (.frq.count), I used LibreOffice Calc and gedit (text editor) to did the input for TREEMIX.

ADD COMMENT
0
Entering edit mode
2.8 years ago
strive • 0

Method 1:

  1. Get .tped file

$ vcftools --vcf file.vcf --plink-tped --out file

  1. Modify the first column in the .tfam file

    before

    file.tfam:

    sampe1 sampe1 0 0 0 0

    sampe2 sampe2 0 0 0 0

    after

    file.tfam:

    pop1 sampe1 0 0 0 0

    pop1 sampe2 0 0 0 0

  2. Define sample and population files(pop.list)

    pop1 sample1 pop1

    pop1 sample2 pop1

    pop2 sample3 pop2

  3. $ plink --tfile file --freq --within pop.list

  4. $ gzip plink.frq.strat

  5. $ python2 plink2treemix.py plink.frq.strat.gz treemix_in.gz

  6. run Treemix (treemix_in.gz is input file of Treemix)

Method 2

populations module in Stacks

$ populations -V file.vcf - O outpath -M pop --treemix

The results file is file.treemix, delete the comment on the first line and compress the file as a Treemix input file.

ADD COMMENT

Login before adding your answer.

Traffic: 1883 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6