Question

Getting data into Plink! and TreeMix

1

Entering edit mode

10.6 years ago

hermathena ▴ 40

Dear All,

I am trying to get non-model data into the software TreeMix. I have data from multiple species, 1-10 individuals for each. I start with a VCF of biallelic SNPs from GATK and generate Plink input with vcftools:

vcftools --vcf exon.biallelicSNPs.vcf --plink --out treemix.plink --keep some_individuals.txt

I then run Plink, using a cluster file that assigns individuals to their species:

plink --file treemix.plink --freq --noweb --missing --within treemix.data.clust

Several warnings like this are produced - not sure what they really mean:

Duplicate marker name found: [ 1347331 ]

Anyway, there is a plink.frq.strat output file. I gzip that and use the auxiliary script of TreeMix to make their input files. I get the error:

Traceback (most recent call last):
  File "plink2treemix.py", line 24, in <module>
    total = line[7]
IndexError: list index out of range

Hmmmm. Really not sure why! If anyone has experience of getting data into treeMix, especially for non-human taxa where the earlier steps may have been the problem, I would be very grateful!!!

Many thanks,

genome next-gen SNP • 17k views

ADD COMMENT • link updated 3.2 years ago by strive • 0 • written 10.6 years ago by hermathena ▴ 40

1

Entering edit mode

hi, have you solved this problem ? I have same problem. Thank you

ADD REPLY • link 8.2 years ago by Chenfei Zheng ▴ 60

0

Entering edit mode

I have managed to use TreeMix. The way I did it is that I had illumina reads from different isolates of the same species, so I mapped each of my library to the same reference individually and called SNPs using the PoPoolation toolkit. I then had quick script to change the PoPoolation output to the TreeMix standard.

ADD REPLY • link 10.5 years ago by Adrian Pelin ★ 2.7k

0

Entering edit mode

Hi all,

I have been the same problem, though the error is:

Traceback (most recent call last):
  File "./plink2treemix.py", line 23, in <module>
    mc = line[6]

Did anybody solve it?

Felipe

ADD REPLY • link updated 6.3 years ago by Ram 45k • written 7.3 years ago by felipe_o_torquato • 0

0

Entering edit mode

hi, have you solved this problem? i also have the same prolem. Could you give me some suggestions, Thank you!

ADD REPLY • link 4.1 years ago by xmyang36 • 0

Ram · Answer 1 · 2017-03-03

Hi, Plink v. 1.9 does a very good job. You can ask it to read in vcf and --make-bed:

/bin/plink2/plink --vcf input.vcf --make-bed --geno 0.25 --maf 0.01 --snps-only --out input.qual --allow-extra-chr --within my_clusters.clust

Unfortunately the resulting .bim file has periods instead of SNP names, which can be a problem for LD pruning etc. It can be easily fixed with awk:

awk '{$2 = $1":"$4; print}' input.bim > snp_names.bim

After that you can filter in Plink, and follow Joe Pickrell's instructions - make a frequencies file, re-format for TreeMix. Glitches like the one above can be fixed manually, occasionally there is a single incorrect line - presumably you'll have enough data that one removed SNP wouldn't matter (unless you are looking for associations, etc).

score 0 · Answer 2 · 2018-04-26

0

Entering edit mode

7.1 years ago

Gabriel R. ★ 2.9k

I posted this to another question, it is one line in glactools:

glactools bplink2acf --fai reference.fai [PLINK prefix] |  glactools acf2treemix -  |gzip > treemix.gz

ADD COMMENT • link 7.1 years ago by Gabriel R. ★ 2.9k

Ram · Answer 3 · 2018-06-02

Hi,

I just posted a similar reply in researchgate. I will copy it here. Unfortunately I could not use glactools from Gabriel because I have no reference.fai (my vcf was generated with STACKS denovo).

So, I started with a .vcf file and converted it to the custom plink format. Treemix requires a specifically formatted file. The website says that in order to create this file you need to run:

$ plink --bfile data --freq --missing --within data.clust 
$ gzip plink.frq
$ plink2treemix.py plink.frq.gz treemix.frq.gz

The first codeline ($1) requires a .clst file. .clst files are generated with --write-cluster --family and it consists in three columns including 1. Family ID; 2. ID; 3. cluster. I used several awk commands to format the information properly.

The first codeline ($1) produces a .frq file which according to plink2's webiste includes:

I. CHR [Chromosome code]
II. SNP [Variant identifier]
III. A1 [Allele 1 (usually minor)]
IV. A2 [Allele 2 (usually major)]
V. MAF [Allele 1 frequency]
VI. NCHROBS [Number of allele observations]

So, here start the problems. If you open the provided python file:

$ nano plink2treemix.py

You will notice that the script requires that the file has 8 columns (remember that python starts counting in 0. mc = line[6], total = line[7] are columns 7 and 8 respectively). I think this might be a problem with the plink versions. I was using plink2 (v1.9) and the program was written when plink1 (v1.07) was around. So if you hit with the following error:

ERROR: treemix mc = line[6] indexerror: list index out of range

I expect it to be a mismatch between plinks versions.

So then I switched to plink1 and it complained that the SNPs had similar names. Because my data was generated with STACKS denovo, the SNPs do not have a position in the genome and <mydata.bim> had . for SNP. I used awk to solve this:

$ awk '{$2 = $1"_"$4; print}' mydata.bim > mydata.SNPrenamed.bim ################ so, we change the second column ($2) to be a combination including column 1, an underscore and column 4 ( $1"_"$4 ).

Plink kept complaining so I decided to take a new approach.

I noticed that the POPULATIONS mode as part of STACKS 1 had a treemix flag.

I created a population map for stacks, which is a file composed of two columns, the first one with individuals and the second with populations:

IND_01 * TAB * POP_A
IND_02 * TAB * POP_A
IND_100 * TAB * POP_C

So I just went on STACKS v1.48 and did:

/path/to/stacks-1.48/bin/populations --in_vcf mydata.vcf --treemix -O ./ -M pop_map.tsv

--in_vcf : input vcf
--treemix : generate a treemix output
-O : output folder
-M : population map

And it worked!

Ram · Answer 4 · 2018-10-17

Hi,

I had the same problem. I began with my .vcf from STACKS 1.48, de novo pipeline.

I used vcftools to count alleles at each population (I have six), with this command:

vcftools --vcf batch_23001.vcf --counts --keep AG1 --out AG1 (x6 different populations)

AG1 after --keep is a popmap belonging to one population with this format:

Sample1
Sample2
...

Finally, with these vcftools outputs (.frq.count), I used LibreOffice Calc and gedit (text editor) to did the input for TREEMIX.

score 0 · Answer 5 · 2022-03-18

Method 1：

Get .tped file

$ vcftools --vcf file.vcf --plink-tped --out file

Modify the first column in the .tfam file

before

file.tfam:

sampe1 sampe1 0 0 0 0

sampe2 sampe2 0 0 0 0

after

file.tfam:

pop1 sampe1 0 0 0 0

pop1 sampe2 0 0 0 0
Define sample and population files(pop.list)

pop1 sample1 pop1

pop1 sample2 pop1

pop2 sample3 pop2
$ plink --tfile file --freq --within pop.list
$ gzip plink.frq.strat
$ python2 plink2treemix.py plink.frq.strat.gz treemix_in.gz
run Treemix (treemix_in.gz is input file of Treemix)

Method 2

populations module in Stacks

$ populations -V file.vcf - O outpath -M pop --treemix

The results file is file.treemix, delete the comment on the first line and compress the file as a Treemix input file.