Question

How to calculate allele frequencies by group

0

Entering edit mode

2.9 years ago

yoser4 ▴ 10

Hello dears all, I have a VCF file containing SNPs for 100 samples (divided into a dozen varieties), and I want to group them by variety and then calculate allele frequencies separately. The files are very large and cannot be easily split. Does anyone have any ideas, it would be greatly appreciated. （The reason for grouping is because I want to draw this form of graph. It can be understood that each polyline represents a variety, and I can intuitively see the difference in SNP frequency between varieties） enter image description here

frequency allele • 1.8k views

ADD COMMENT • link 2.9 years ago by yoser4 ▴ 10

0

Entering edit mode

could you please explain exactly why the vcf file cannot be split, and what format it is in presently. for instance, vcf.gz.tbi, or just .vcf, or .bcf2 or what

ADD REPLY • link 2.9 years ago by LauferVA 4.8k

0

Entering edit mode

Sorry, I'm new to bioinformatics. Not very good at some operations. My file is a .vcf.gz file, 30G in size, so I can't split it using normal linux commands. I'll attach some examples below. Regards, Vincent Laufer

ADD REPLY • link 2.9 years ago by yoser4 ▴ 10

0

Entering edit mode

enter image description here

ADD REPLY • link 2.9 years ago by yoser4 ▴ 10

score 2 · Answer 1 · 2022-08-25

Preface - thinking long-term:

If it is OK, I would like to start with a general remark. I hope it will help in the long-term. The most important thing I have to say in response to this question actually is not a direct answer (that is below). Rather, I want to say that personally, I wish I had learned that for applications like this, there are usually well developed bioinformatics tools that have been developed by groups of experienced people working together.

In particular, I think that above claim is more likely to be true for topics relating to routine I/O operations (which, if you think about this from the perspective of parsing strings) that do not substantively differ from text processing applications in other fields having large data.

What I am trying to say is, in cases like this (i.e., subsetting a large text file) you are usually better off using a published tool, actually for several reasons.

No matter any person's coding skill, anyone can make a mistake, myself included.
But, that mistake is more likely to have been caught and fixed if it was made in a tool that has been extensively validated.
1. If you do go with a pre-written tool, there are lots of other benefits. For instance, they frequently have lots of other capabilities. I'll illustrate in the specific answer, below.

Specific answer: I recommend considering whether bcftools could be good option for you. You can use this tool to not only subset the vcf file into groups of patients or even single-sample .vcf files (as you have requested), but also bcftools can be used to filter (even very large) .vcf files based on variant properties as well, as can be seen here. In addition, suppose you ultimately study these samples for weeks and weeks, or even for years, @6e02999e. Suppose that at some point in this study you realize that you suspect some phenomenon, like alternative splicing, could be responsible for what is seen in the people/animals they come from. If you go with a published tool like samtools, plink2, bcftools, etc., you are likely to be able to re-use elements of the body of code you already have written, for instance, to annotate your snps with additional modifiers. If you do it on your own, by contrast, you are back to square one every time you want to do a new task.

I hope this helps you!

VL