Hi,
I try to use public gnomad genomes for downstream tests, but the problem is that I only need male controls from these files. The file looks like this (also header, but not shown due to large size)
#CHROM POS ID REF ALT QUAL FILTER INFO
chr1 69511 rs2691305 A G . PASS AC=70482;AN=83312;AF=0.846001;popmax=eas;faf95_popmax=0.975176;AC_non_v2_XX=28734;AN_non_v2_XX=33774;AF_non_v2_XX=0.850773;nhomalt_non_v2_XX=13253;AC_non_cancer_fin_XX=1080;AN_non_cancer_fin_XX=1090;AF_non_cancer_fin_XX=0.990826;nhomalt_non_cancer_fin_XX=537;AC_non_neuro_nfe=32992;AN_non_neuro_nfe=34106;AF_non_neuro_nfe=0.967337;nhomalt_non_neuro_nfe=16243;AC_non_neuro_afr_XY=5275;AN_non_neuro_afr_XY=8862;AF_non_neuro_afr_XY=0.595238;nhomalt_non_neuro_afr_XY=1908;AC_non_neuro_nfe_XY=13529;AN_non_neuro_nfe_XY=13954;AF_non_neuro_nfe_XY=0.969543;nhomalt_non_neuro_nfe_XY=6668;AC_controls_and_biobanks_eas_XY=1209;AN_controls_and_biobanks_eas_XY=1210;AF_controls_and_biobanks_eas_XY=0.999174;nhomalt_controls_and_biobanks_eas_XY=604;AC_non_neuro_sas_XX=623;AN_non_neuro_sas_XX=646;AF_non_neuro_sas_XX=0.964396;nhomalt_non_neuro_sas_XX=304;AC_non_v2=53211;AN_non_v2=62346;AF_non_v2=0.853479;nhomalt_non_v2=24615;AC_non_topmed_nfe_XX=4123;AN_non_topmed_nfe_XX=4274;AF_non_topmed_nfe_XX=0.96467;nhomalt_non_topmed_nfe_XX=2022;AC_non_v2_mid=132;AN_non_v2_mid=146;AF_non_v2_mid=0.90411;nhomalt_non_v2_mid=63;AC_non_topmed_sas=2561;AN_non_topmed_sas=2620;AF_non_topmed_sas=0.977481;nhomalt_non_topmed_sas=1262;AC_non_cancer_eas_XX=1882;AN_non_cancer_eas_XX=1882;AC_nfe=34637;AN_nfe=35806;AF_nfe=0.967352;nhomalt_nfe=17050;AC_popmax=4423;AN_popmax=4424;AF_popmax=0.999774;nhomalt_popmax=2211;faf95_sas=0.945756;faf99_sas=0.93297;faf95_eas=0.975176;faf99_eas=0.965135;faf95_amr=0.876957;faf99_amr=0.869526;faf95_afr=0.586993;faf99_afr=0.583779;faf95=0.840765;faf99=0.838605;faf95_nfe=0.958818;faf99_nfe=0.955301;age_hist_het_bin_freq=123|146|149|217|287|242|200|187|124|72;age_hist_het_n_smaller=460;age_hist_het_n_larger=24;age_hist_hom_bin_freq=569|655|765|1405|2028|1746|1691|1518|1005|654;age_hist_hom_n_smaller=1206;age_hist_hom_n_larger=215;FS=0;MQ=41.6849;MQRankSum=-3.98;QD=26.0414;ReadPosRankSum=0.51;VarDP=1761397;QUALapprox=45869299;AS_FS=0;AS_MQ=41.6847;AS_MQRankSum=-3.972;AS_pab_max=1;AS_QD=26.0413;AS_ReadPosRankSum=0.51;AS_SOR=0.865485;InbreedingCoeff=0.521183;AS_SB_TABLE=99037,96244|856526,709549;AS_VQSLOD=0.4266;AS_culprit=AS_MQRankSum;NEGATIVE_TRAIN_SITE;allele_type=snv;n_alt_alleles=2;variant_type=multi-snv;segdup;gq_hist_alt_bin_freq=0|0|0|0|39|89|5016|5173|2494|4247|3343|1353|2059|1764|858|1286|1197|599|785|7691;gq_hist_all_bin_freq=0|0|0|0|2833|660|5232|5237|2506|4249|3343|1354|2060|1764|859|1287|1197|599|785|7691;dp_hist_alt_bin_freq=0|0|12706|9449|5594|4053|2707|1532|828|486|262|166|89|45|36|12|6|7|2|3;dp_hist_alt_n_larger=10;dp_hist_all_bin_freq=0|0|13838|11344|6042|4209|2735|1535|828|487|262|166|89|45|36|12|6|7|2|3;dp_hist_all_n_smaller=0;dp_hist_all_n_larger=10;ab_hist_alt_bin_freq=0|0|0|0|193|367|513|688|876|676|772|504|439|263|124|56|25|8|0|0;cadd_raw_score=0.112916;cadd_phred=2.209;revel_score=0.053;splice_ai_max_ds=0.02;splice_ai_consequence=donor_gain;primate_ai_score=0.632586;vep=G|missense_variant|MODERATE|OR4F5|ENSG00000186092|Transcript|ENST00000335137|protein_coding|1/1||ENST00000335137.4:c.421A>G|ENSP00000334393.3:p.Thr141Ala|457|421|141|T/A|Aca/Gca|1||1|SNV||HGNC|HGNC:14825|YES||P1|CCDS30547.1|ENSP00000334393|||||tolerated(0.820)|benign(0.000)|Gene3D:1&Pfam:PF13853&PROSITE_profiles:PS50262&Superfamily:SSF81321&Transmembrane_helices:TMhelix&CDD:cd15226|||||||||,G|missense_variant|MODERATE|OR4F5|ENSG00000186092|Transcript|ENST00000641515|protein_coding|3/3||ENST00000641515.2:c.484A>G|ENSP00000493376.2:p.Thr162Ala|544|484|162|T/A|Aca/Gca|1||1|SNV||HGNC|HGNC:14825|||||ENSP00000493376|||||tolerated(0.850)|benign(0.000)|Transmembrane_helices:TMhelix&CDD:cd15226&PANTHER:PTHR26451&PANTHER:PTHR26451&Pfam:PF13853&PROSITE_profiles:PS50262&Gene3D:1&Superfamily:SSF81321|||||||||,G|missense_variant|MODERATE|OR4F5|79501|Transcript|NM_001005484.1|protein_coding|1/1||NM_001005484.1:c.421A>G|NP_001005484.1:p.Thr141Ala|421|421|141|T/A|Aca/Gca|1||1|SNV||EntrezGene|HGNC:14825|YES||||NP_001005484.1|||||tolerated(0.820)|benign(0.000)||||||||||
I have to remove all parts of file that contains _XX in the end ( need to do test against men control, so all women data should be removed from control file)
I tried bcftools query bcftools query -f '%CHROM\t%POS\t%ID\t%REF\t%ALT\t%QUAL\t%FILTER\t%INFO/AC_nfe_XY\t%INFO/AN_nfe_XY\n' $file > $outfile
but got Error: no such tag defined in the VCF header: INFO/_XY
Also bcftools query removes file header, and to save all men control numbers, I have to write script row for three meters.
Is there some easier and smarter way to remove female controls from this file without damaging file structure?
I also tried to modify this MIT sample code to take only controls related with men, but without success. https://github.com/mhguo1/TRAPD/blob/master/code/count_controls.py
Thank you!
works on my machine (1.12):
Thank you!
Also the problem is that bcftools removes names of controls, but actually I need to save them, so there is no additional tabs and new lines, but only "XX=" has to be removed
what is 'controls' ? the VCF header ? why don't you call
bcftools view --header-only
before a second call tobcftools query
?ok, thank you! The command should remove all characters between ; symbols,( _XX )
So that from
stays only
I also try to implement sed commands, but no success
it's not clear how the command above is related to the
bcftools query
....you want
bcftools annotate --rename-annots file.txt -x '^INFO/....
Maybe bcftools are not necessary in this case
there are a lot of female controls in gnomad file, so maybe bcftools with defining INFo field is not the smartest approach