Entering edit mode
5.8 years ago
adeena_hassan
▴
50
Assalam o Alaikum Everyone!
I have 3 annotated variants list files (excel files) of three patients from same family (2 parents 1 child). I want to extract the positions that are Hetro in parents and Homo in child.
For example:
SNP_Indel_1.xlsx (Parent)
#CHROM POS REF ALT DP AD QUAL MQ Zygosity FILTER Effect
chr1 762,273 G A 39 39 1372.77 41.55 HET PASS non_coding_exon_variant
chr1 866,319 G A 20 20 740.77 60.0 HOM PASS intron_variant
SNP_Indel_2.xlsx (Parent)
#CHROM POS REF ALT DP AD QUAL MQ Zygosity FILTER Effect
chr1 762,109 C T 173 42 1011.77 52.75 HET PASS non_coding_exon_variant
chr1 762,273 G A 35 35 1302.77 42.67 HET PASS non_coding_exon_variant
SNP_Indel_3.xlsx (Child)
#CHROM POS REF ALT DP AD QUAL MQ Zygosity FILTER Effect
chr1 762,273 G A 40 40 1457.77 40.67 HOM PASS non_coding_exon_variant
chr1 866,319 G A 15 15 546.77 60.0 HOM PASS intron_variant
In the above example position 762,273 is HET in both parents and HOM in child i want to extract that whole row in a separate file. Is there any command line solution or perl script for this ??? Could anyone help ??
Thank you so much!
Hello adeena_hassan ,
I would recommend to convert your files to valid
vcf
files. Doing this you can make use of all the beautiful tools out there for filteringvcf
files. I could guide you. But for this, I have to know which reference genome was used and/or if you have the correspondingfasta
file or alignment file (bam
)?fin swimmer
Hi finswimmer ,
Human genome was used as a reference genome and I have the corresponding vcf file (Filtered_Variants.vcf) , bam file and fastq files. Could u guide now?
Hello again,
you have a corresponding
vcf
file beside the excel file you've presented? Then we can start working directly with it. Is it one file per sample or is it one file that contains all three samples?If you are unsure just copy&paste all header lines from the beginning of the file. These lines starts with
#
.yes, i have the corresponding vcf files of all three samples. It is one file per sample.
Following are the headers of one file
Is this the whole header? Or are there lines with contig information? Those look like this:
##contig=<ID=chr1,length=249250621>
In the meantime, install bcftools if not already done. I recommend doing it via conda. The first part of this tutorial I wrote some time ago, might be useful for you, if you are not familiar with conda.
yeah, there are lines with with contig information. I skipped them earlier because of the header length.
bcfftools already installed. Version: 0.1.19-96b5f2294a
Thank you so much for response!
Very good. But please upgrade the
bcftools
version. This one is very, very old. Lots of features are missing. The current one is v1.9. If this is done you can solver your task like this:1.
bgzip
andtabix
-index yourvcf
files:2. merge the three file into one
./.
. You can use the--missing-to-ref
parameter to set them to0/0
(which homozygous reference base) if you like. But be careful with this option3. Let's filter for homozygous variants in the child, that are heterozygous in both parents:
More examples and options for filtering can be found the online manual.
Thank you so much. Its helpful. You made my day!
you can write a simple Python script for this. Just use the
csv
module to read each line into a dict, then you can print out the entries that match your criteriaI hve a little bit understanding of perl language. Is this csv module work with perl ? And is there any possible command-line solution in linux ??
You can do a simple extraction using awk for each file. The commands will look something like this
This will extract all the HET positions but I need only those HET positions in parents that are HOM in child.
You can make a list using the POS column using awk, filter with uniq, and search the child file like this