Export Allele Frequency Of All Alleles In The Whole Data For A Subsets Of Individuals
2
0
Entering edit mode
13.6 years ago
Jianfengmao ▴ 320

Background: I am new to genomics and have few programing skills. I depends on VCF and VCFTools for my genomic analysis.

I have my population genomic data in VCF format which is popular in human 1000 genome project. By using VCFTools, I can get allele frequency statistics for each locus and this allele frequency is summarized by all the individuals in the VCF file.

my question:
I want to prepare an input file for some population genetic software, like BayeScan (http://cmpg.unibe.ch/software/bayescan/index.html) to predicting selection along the genome. I need to export an allele frequency statistics for each populations (a group of individuals), and the allele frequency for the same loci (genomic position) should have the same alleles and in the same order.

But, the allele frequency from VCFTools do not satisfy my need. For some ZERO count allele, it outputs nothing. I need the allele frequency was output for all alleles in the locus for all the populations.

I would like to hearing your directions on my problem. All who depends on VCF will benefit from your suggestions.

Thanks in advance.

Example:

This is an example for my data, here euro, ice, norA are three populations. For the position, 1, 3, 32 there are frequency for two allele each in population euro, but only one for the other two populations. I need the Zero count alleles be output as 0 not nothing.

abt6-mao-mbp:BayeScan jianfengmao$ head euro_chr1.frq.count

CHROM POS NALLELES NCHR {ALLELE:COUNT}

1 1 2 9 T:8 A:1

1 3 2 9 A:8 G:1

1 32 2 9 C:7 G:2

1 69 1 9 A:9

1 75 1 9 G:9

1 87 1 9 C:9

1 88 1 9 T:9

1 116 1 9 A:9

1 141 2 9 C:0 T:9

abt6-mao-mbp:BayeScan jianfengmao$ head ice_chr1.frq.count

CHROM POS NALLELES NCHR {ALLELE:COUNT}

1 1 1 3 T:3

1 3 1 3 A:3

1 32 1 3 C:3

1 69 1 3 A:3

1 75 1 3 G:3

1 87 1 3 C:3

1 88 1 3 T:3

1 116 1 3 A:3

1 141 2 3 C:0 T:3

abt6-mao-mbp:BayeScan jianfengmao$ head norA_chr1.frq.count

CHROM POS NALLELES NCHR {ALLELE:COUNT}

1 1 1 6 T:6

1 3 1 6 A:6

1 32 1 7 C:7

1 69 1 6 A:6

1 75 2 6 G:5 A:1

1 87 1 6 C:6

1 88 1 6 T:6

1 116 1 6 A:6

1 141 2 7 C:0 T:7

vcf vcftools • 6.6k views
ADD COMMENT
3
Entering edit mode

Hi jianfengmao, this is your 9th questions on Biostar. Could you please validate (or comment) the answers you received for your previous questions ?

ADD REPLY
1
Entering edit mode

Also, several of your questions, including this one, can be easily solved by writing a script. As you have been in this field for at least several months, I would not call you a newcomer. You should really learn a whatever scripting language and solve the problem by yourself. This is your work.

ADD REPLY
0
Entering edit mode

Ih3, Your suggestion is just what I want and what I am doing now. I intend to do that by myself and I used all my time on that. And, I have gotten much till now. But, I have not learn everything I need. I have none training on computer and programming, I am sorry for my so many questions here, really sorry. I will try my best to do that by myself. I will learn programming in perl and python, like I started to learn R four years ago in an isolated environment in China. Thanks for what I have gotten from you all.

ADD REPLY
0
Entering edit mode

Pierre Lindenbaum, I will do what you have mentioned. Thanks for your kindness.

ADD REPLY
2
Entering edit mode
13.6 years ago
Joachim ★ 2.9k

Have a look at a scripting language that allows you to easily read your data files line-by-line. Then split the lines into columns and evaluate the data in those columns respectively. You can use a hash-map to keep track of the counts for each allele (allele names become keys, their accumulated frequency is the respective value).

For Python, it would look a bit like this:

Regarding the frequency count, you would use the hash-map/dictionary as follows (I call the Python dictionary frequencies):

  • you can check whether you have stored a frequency for an allele x already using if _x_ in frequencies:
  • if you have a frequency for that allele stored already, do accumulated_count = frequencies[_x_] followed by frequencies[_x_] = accumulated_count + count_of_this_line
  • if there is no frequency stored for that allele yet, then do frequency[_x_] = count_of_this_line
ADD COMMENT
0
Entering edit mode

Joachim, Thanks for your kindness. I will try to do that following you.

ADD REPLY
0
Entering edit mode

Dear Joachim, I checked what you have written for me. Thanks for your patience for me. Though I have begun to learn programming, I still can not do such a simple thing by myself. You have given me a good directions. Thanks a lot.

ADD REPLY
1
Entering edit mode
13.6 years ago
lh3 33k

This Perl script takes a VCF and a sample-population pair list as input and outputs the allele count at each site for each population (you need the -s option). The Lua version has a few comments. Example:

zcat ALL.2of4intersection.20100804.genotypes.vcf.gz | vcf2afs.pl -s 20100804.ALL.panel -

This DOESN'T answer your question, but you can learn from the script.

ADD COMMENT
0
Entering edit mode

Dear Ih3, Thanks a lot for your kindness. I was encouraged by your former comments.

ADD REPLY

Login before adding your answer.

Traffic: 2345 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6