Preface - thinking long-term:
If it is OK, I would like to start with a general remark. I hope it will help in the long-term. The most important thing I have to say in response to this question actually is not a direct answer (that is below). Rather, I want to say that personally, I wish I had learned that for applications like this, there are usually well developed bioinformatics tools that have been developed by groups of experienced people working together.
In particular, I think that above claim is more likely to be true for topics relating to routine I/O operations (which, if you think about this from the perspective of parsing strings) that do not substantively differ from text processing applications in other fields having large data.
What I am trying to say is, in cases like this (i.e., subsetting a large text file) you are usually better off using a published tool, actually for several reasons.
- No matter any person's coding skill, anyone can make a mistake,
myself included.
- But, that mistake is more likely to have been caught and fixed if it was made in a tool that has been extensively validated.
- If you do go with a pre-written tool, there are lots of other benefits. For instance, they frequently have lots of other capabilities. I'll illustrate in the specific answer, below.
Specific answer:
I recommend considering whether bcftools could be good option for you. You can use this tool to not only subset
the vcf file into groups of patients or even single-sample .vcf files (as you have requested), but also bcftools
can be used to filter (even very large) .vcf files based on variant properties as well, as can be seen here.
In addition, suppose you ultimately study these samples for weeks and weeks, or even for years, @6e02999e. Suppose that at some point in this study you realize that you suspect some phenomenon, like alternative splicing, could be responsible for what is seen in the people/animals they come from.
If you go with a published tool like samtools
, plink2
, bcftools
, etc., you are likely to be able to re-use elements of the body of code you already have written, for instance, to annotate
your snps with additional modifiers. If you do it on your own, by contrast, you are back to square one every time you want to do a new task.
I hope this helps you!
VL
could you please explain exactly why the vcf file cannot be split, and what format it is in presently. for instance, vcf.gz.tbi, or just .vcf, or .bcf2 or what
Sorry, I'm new to bioinformatics. Not very good at some operations. My file is a .vcf.gz file, 30G in size, so I can't split it using normal linux commands. I'll attach some examples below. Regards, Vincent Laufer