Question

how to filter 3 variants files (in excel format) simultaneously?

0

Entering edit mode

6.6 years ago

adeena_hassan ▴ 50

Assalam o Alaikum Everyone!

I have 3 annotated variants list files (excel files) of three patients from same family (2 parents 1 child). I want to extract the positions that are Hetro in parents and Homo in child.

For example:

SNP_Indel_1.xlsx (Parent)

#CHROM  POS REF ALT DP  AD  QUAL    MQ       Zygosity   FILTER     Effect
chr1    762,273 G   A   39  39  1372.77 41.55   HET    PASS    non_coding_exon_variant
chr1    866,319 G   A   20  20  740.77  60.0    HOM    PASS     intron_variant

SNP_Indel_2.xlsx (Parent)

#CHROM  POS REF ALT DP  AD  QUAL    MQ      Zygosity    FILTER  Effect
chr1    762,109 C   T   173 42  1011.77 52.75   HET PASS    non_coding_exon_variant
chr1    762,273 G   A   35  35  1302.77 42.67   HET PASS    non_coding_exon_variant

SNP_Indel_3.xlsx (Child)

#CHROM  POS REF ALT DP  AD  QUAL    MQ       Zygosity   FILTER  Effect
chr1    762,273 G   A   40  40  1457.77 40.67   HOM PASS    non_coding_exon_variant
chr1    866,319 G   A   15  15  546.77  60.0    HOM PASS    intron_variant

In the above example position 762,273 is HET in both parents and HOM in child i want to extract that whole row in a separate file. Is there any command line solution or perl script for this ??? Could anyone help ??

Thank you so much!

whole exome sequencing perl commandline Snp • 3.9k views

ADD COMMENT • link 6.6 years ago by adeena_hassan ▴ 50

1

Entering edit mode

Hello adeena_hassan ,

I would recommend to convert your files to valid vcf files. Doing this you can make use of all the beautiful tools out there for filtering vcf files. I could guide you. But for this, I have to know which reference genome was used and/or if you have the corresponding fasta file or alignment file (bam)?

fin swimmer

ADD REPLY • link 6.6 years ago by finswimmer 16k

0

Entering edit mode

Hi finswimmer ,

Human genome was used as a reference genome and I have the corresponding vcf file (Filtered_Variants.vcf) , bam file and fastq files. Could u guide now?

ADD REPLY • link 6.6 years ago by adeena_hassan ▴ 50

0

Entering edit mode

Hello again,

you have a corresponding vcf file beside the excel file you've presented? Then we can start working directly with it. Is it one file per sample or is it one file that contains all three samples?

If you are unsure just copy&paste all header lines from the beginning of the file. These lines starts with #.

ADD REPLY • link 6.6 years ago by finswimmer 16k

0

Entering edit mode

yes, i have the corresponding vcf files of all three samples. It is one file per sample.

Following are the headers of one file

##fileformat=VCFv4.1
##ALT=<ID=*:DEL,Description="Represents any possible spanning deletion allele at this location">
##ALT=<ID=NON_REF,Description="Represents any possible alternative allele at this location">
##FILTER=<ID=LowQual,Description="Low quality">
##FILTER=<ID=MG_INDEL_Filter,Description="QD < 2.0  ||  FS > 200.0  ||  ReadPosRankSum < -20.0">
##FILTER=<ID=MG_SNP_Filter,Description="QD < 2.0  || FS > 60.0 || MQ < 40.0 ||  MQRankSum < -12.5  ||  ReadPosRankSum < -8.0">
##FORMAT=<ID=AD,Number=.,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block">
##FORMAT=<ID=PGT,Number=1,Type=String,Description="Physical phasing haplotype information, describing how the alternate alleles are phased in relation to one another">
##FORMAT=<ID=PID,Number=1,Type=String,Description="Physical phasing ID information, where each unique ID within a given sample (but not across samples) connects records within a phasing group">
##FORMAT=<ID=PL,Number=G,Type=Integer,Description="Normalized, Phred-scaled likelihoods for genotypes as defined in the VCF specification">
##FORMAT=<ID=RGQ,Number=1,Type=Integer,Description="Unconditional reference genotype confidence, encoded as a phred quality -10*log10 p(genotype call is wrong)">
##FORMAT=<ID=SB,Number=4,Type=Integer,Description="Per-sample component statistics which comprise the Fisher's Exact Test to detect strand bias.">
##GATKCommandLine.CombineVariants=<ID=CombineVariants,Version=2015.1-3.4.0-1-ga5ca3fc,Date="Mon May 02 11:39:20 KST 2016",Epoch=1462156760565,CommandLineOptions="analysis_type=CombineVariants input_file=[] showFullBamList=false read_buffer_size=null phone_home=NO_ET gatk_key=null tag=NA read_filter=[] disable_read_filter=[] intervals=null excludeIntervals=null interval_set_rule=UNION interval_merging=ALL interval_padding=0 reference_sequence=/ruby/Tools/chul/ExomePipeline/reference/ucsc.hg19.fasta nonDeterministicRandomSeed=false disableDithering=false maxRuntime=-1 maxRuntimeUnits=MINUTES downsampling_type=BY_SAMPLE downsample_to_fraction=null downsample_to_coverage=1000 baq=OFF baqGapOpenPenalty=40.0 refactor_NDN_cigar_string=false fix_misencoded_quality_scores=false allow_potentially_misencoded_quality_scores=false useOriginalQualities=false defaultBaseQualities=-1 performanceLog=null BQSR=null quantize_quals=0 disable_indel_quals=false emit_original_quals=false preserve_qscores_less_than=6 globalQScorePrior=-1.0 validation_strictness=SILENT remove_program_records=false keep_program_records=false sample_rename_mapping_file=null unsafe=null disable_auto_index_creation_and_locking_when_reading_rods=true no_cmdline_in_header=false sites_only=false never_trim_vcf_format_field=false bcf=false bam_compression=null simplifyBAM=false disable_bam_indexing=false generate_md5=false num_threads=1 num_cpu_threads_per_data_thread=1 num_io_threads=0 monitorThreadEfficiency=false num_bam_file_handles=null read_group_black_list=null pedigree=[] pedigreeString=[] pedigreeValidationType=STRICT allow_intervals_with_unindexed_bam=false generateShadowBCF=false variant_index_type=DYNAMIC_SEEK variant_index_parameter=-1 logging_level=INFO log_to_file=null help=false version=false variant=[(RodBindingCollection [(RodBinding name=variant source=/ruby/Analysis/Project/1603KHF-0081/Analysis/DER24-1/VARIANT/DER24-1.rawSNPs.filtered.vcf)]), (RodBindingCollection [(RodBinding name=variant2 source=/ruby/Analysis/Project/1603KHF-0081/Analysis/DER24-1/VARIANT/DER24-1.rawINDELs.filtered.vcf)])] out=/ruby/Analysis/Project/1603KHF-0081/Analysis/DER24-1/VARIANT/DER24-1.Filtered.Variants.vcf genotypemergeoption=UNSORTED filteredrecordsmergetype=KEEP_IF_ANY_UNFILTERED multipleallelesmergetype=BY_TYPE rod_priority_list=null printComplexMerges=false filteredAreUncalled=false minimalVCF=false excludeNonVariants=false setKey=set assumeIdenticalSamples=false minimumN=1 suppressCommandLineHeader=false mergeInfoWithMaxAC=false filter_reads_with_N_cigar=false filter_mismatching_base_and_quals=false filter_bases_not_stored=false">

ADD REPLY • link 6.6 years ago by adeena_hassan ▴ 50

1

Entering edit mode

Is this the whole header? Or are there lines with contig information? Those look like this: ##contig=<ID=chr1,length=249250621>

In the meantime, install bcftools if not already done. I recommend doing it via conda. The first part of this tutorial I wrote some time ago, might be useful for you, if you are not familiar with conda.

ADD REPLY • link 6.6 years ago by finswimmer 16k

0

Entering edit mode

yeah, there are lines with with contig information. I skipped them earlier because of the header length.

bcfftools already installed. Version: 0.1.19-96b5f2294a

Thank you so much for response!

##contig=<ID=chrM,length=16571,assembly=hg19>
##contig=<ID=chr1,length=249250621,assembly=hg19>
##contig=<ID=chr2,length=243199373,assembly=hg19>
##contig=<ID=chr3,length=198022430,assembly=hg19>
##contig=<ID=chr4,length=191154276,assembly=hg19>
##contig=<ID=chr5,length=180915260,assembly=hg19>
##contig=<ID=chr6,length=171115067,assembly=hg19>
##contig=<ID=chr7,length=159138663,assembly=hg19>
##contig=<ID=chr8,length=146364022,assembly=hg19>
##contig=<ID=chr9,length=141213431,assembly=hg19>
##contig=<ID=chr10,length=135534747,assembly=hg19>
##contig=<ID=chr11,length=135006516,assembly=hg19>
##contig=<ID=chr12,length=133851895,assembly=hg19>
##contig=<ID=chr13,length=115169878,assembly=hg19>
##contig=<ID=chr14,length=107349540,assembly=hg19>
##contig=<ID=chr15,length=102531392,assembly=hg19>
##contig=<ID=chr16,length=90354753,assembly=hg19>
##contig=<ID=chr17,length=81195210,assembly=hg19>
##contig=<ID=chr18,length=78077248,assembly=hg19>
##contig=<ID=chr19,length=59128983,assembly=hg19>
##contig=<ID=chr20,length=63025520,assembly=hg19>
##contig=<ID=chr21,length=48129895,assembly=hg19>
##contig=<ID=chr22,length=51304566,assembly=hg19>
##contig=<ID=chrX,length=155270560,assembly=hg19>
##contig=<ID=chrY,length=59373566,assembly=hg19>
##contig=<ID=chr1_gl000191_random,length=106433,assembly=hg19>
##contig=<ID=chr1_gl000192_random,length=547496,assembly=hg19>
##contig=<ID=chr4_ctg9_hap1,length=590426,assembly=hg19>
##contig=<ID=chr4_gl000193_random,length=189789,assembly=hg19>
##contig=<ID=chr4_gl000194_random,length=191469,assembly=hg19>
##contig=<ID=chr6_apd_hap1,length=4622290,assembly=hg19>
##contig=<ID=chr6_cox_hap2,length=4795371,assembly=hg19>
##contig=<ID=chr6_dbb_hap3,length=4610396,assembly=hg19>
##contig=<ID=chr6_mann_hap4,length=4683263,assembly=hg19>
##contig=<ID=chr6_mcf_hap5,length=4833398,assembly=hg19>
##contig=<ID=chr6_qbl_hap6,length=4611984,assembly=hg19>
##contig=<ID=chr6_ssto_hap7,length=4928567,assembly=hg19>
##contig=<ID=chr7_gl000195_random,length=182896,assembly=hg19>
##contig=<ID=chr8_gl000196_random,length=38914,assembly=hg19>
##contig=<ID=chr8_gl000197_random,length=37175,assembly=hg19>
##contig=<ID=chr9_gl000198_random,length=90085,assembly=hg19>
##contig=<ID=chr9_gl000199_random,length=169874,assembly=hg19>
##contig=<ID=chr9_gl000200_random,length=187035,assembly=hg19>
##contig=<ID=chr9_gl000201_random,length=36148,assembly=hg19>
##contig=<ID=chr11_gl000202_random,length=40103,assembly=hg19>
##contig=<ID=chr17_ctg5_hap1,length=1680828,assembly=hg19>
##contig=<ID=chr17_gl000203_random,length=37498,assembly=hg19>
##contig=<ID=chr17_gl000204_random,length=81310,assembly=hg19>
##contig=<ID=chr17_gl000205_random,length=174588,assembly=hg19>
##contig=<ID=chr17_gl000206_random,length=41001,assembly=hg19>
##contig=<ID=chr18_gl000207_random,length=4262,assembly=hg19>
##contig=<ID=chr19_gl000208_random,length=92689,assembly=hg19>
##contig=<ID=chr19_gl000209_random,length=159169,assembly=hg19>
##contig=<ID=chr21_gl000210_random,length=27682,assembly=hg19>
##contig=<ID=chrUn_gl000211,length=166566,assembly=hg19>
##contig=<ID=chrUn_gl000212,length=186858,assembly=hg19>
##contig=<ID=chrUn_gl000213,length=164239,assembly=hg19>
##contig=<ID=chrUn_gl000214,length=137718,assembly=hg19>
##contig=<ID=chrUn_gl000215,length=172545,assembly=hg19>
##contig=<ID=chrUn_gl000216,length=172294,assembly=hg19>
##contig=<ID=chrUn_gl000217,length=172149,assembly=hg19>
##contig=<ID=chrUn_gl000218,length=161147,assembly=hg19>
##contig=<ID=chrUn_gl000219,length=179198,assembly=hg19>
##contig=<ID=chrUn_gl000220,length=161802,assembly=hg19>
##contig=<ID=chrUn_gl000221,length=155397,assembly=hg19>
##contig=<ID=chrUn_gl000222,length=186861,assembly=hg19>
##contig=<ID=chrUn_gl000223,length=180455,assembly=hg19>
##contig=<ID=chrUn_gl000224,length=179693,assembly=hg19>
##contig=<ID=chrUn_gl000225,length=211173,assembly=hg19>
##contig=<ID=chrUn_gl000226,length=15008,assembly=hg19>
##contig=<ID=chrUn_gl000227,length=128374,assembly=hg19>
##contig=<ID=chrUn_gl000228,length=129120,assembly=hg19>
##contig=<ID=chrUn_gl000229,length=19913,assembly=hg19>
##contig=<ID=chrUn_gl000230,length=43691,assembly=hg19>
##contig=<ID=chrUn_gl000231,length=27386,assembly=hg19>
##contig=<ID=chrUn_gl000232,length=40652,assembly=hg19>
##contig=<ID=chrUn_gl000233,length=45941,assembly=hg19>
##contig=<ID=chrUn_gl000234,length=40531,assembly=hg19>

ADD REPLY • link 6.6 years ago by adeena_hassan ▴ 50

1

Entering edit mode

Very good. But please upgrade the bcftools version. This one is very, very old. Lots of features are missing. The current one is v1.9. If this is done you can solver your task like this:

1. bgzip and tabix-index your vcffiles:

$ bgzip -c child.vcf > child.vcf.gz
$ bgzip -c mother.vcf > mother.vcf.gz
$ bgzip -c father.vcf > father.vcf.gz

$ tabix child.vcf.gz
$ tabix mother.vcf.gz
$ tabix father.vcf.gz

2. merge the three file into one

$ bcftools merge child.vcf.gz mother.vcf.gz father.vcf.gz > trio.vcf

The order you give for the vcf files is the order of sample information in your final vcf.
The genotype for variants that are not found in one of the file will be ./.. You can use the --missing-to-ref parameter to set them to 0/0 (which homozygous reference base) if you like. But be careful with this option

3. Let's filter for homozygous variants in the child, that are heterozygous in both parents:

$ bcftools view -i 'GT[0]="AA" && GT[1]="het" && GT[2]="het"' trio.vcf > trio_filtered.vcf

More examples and options for filtering can be found the online manual.

ADD REPLY • link 6.6 years ago by finswimmer 16k

0

Entering edit mode

Thank you so much. Its helpful. You made my day!

ADD REPLY • link 6.6 years ago by adeena_hassan ▴ 50

0

Entering edit mode

you can write a simple Python script for this. Just use the csv module to read each line into a dict, then you can print out the entries that match your criteria

ADD REPLY • link 6.6 years ago by steve ★ 3.5k

0

Entering edit mode

I hve a little bit understanding of perl language. Is this csv module work with perl ? And is there any possible command-line solution in linux ??

ADD REPLY • link 6.6 years ago by adeena_hassan ▴ 50

0

Entering edit mode

You can do a simple extraction using awk for each file. The commands will look something like this

awk '{ if ($9 ~ "HET") parents.xlsx print $0 }' > het_parents.xlsx

awk '{ if ($9 ~ "HOM") child.xlsx print $0 }' > hom_child.xlsx

ADD REPLY • link 6.6 years ago by Gio12 ▴ 220

0

Entering edit mode

This will extract all the HET positions but I need only those HET positions in parents that are HOM in child.

ADD REPLY • link 6.6 years ago by adeena_hassan ▴ 50

0

Entering edit mode

You can make a list using the POS column using awk, filter with uniq, and search the child file like this

cat het_parent1.xlsx het_parent2.xlsx | awk '{print $2}' | uniq > List.txt
cat child.xlsx | grep -f List.txt | awk '{ if ($9 ~ "HOM") print $0 }' > hom_child.xlsx

ADD REPLY • link 6.6 years ago by Gio12 ▴ 220