Entering edit mode
2.1 years ago
rheab1230
▴
140
Hello everyone, I have genotype(vcf) and gene expression file. I want to separate my genotype file based on different subpopulation and use it to train a model to generate model file for each population. I am not able to understand how to separate my vcf file based on samples coming from different population? Also, is there any package that can generate model file by grouping samples coming from same population /ethnicity together by learning it from the data and grouping them based on different ethnicity and the do elastic net training and coss validation Thank you.
This is how my vcf file look like:
##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FILTER=<ID=VQSRTrancheSNP99.80to99.90,Description="Truth sensitivity tranche level for SNP model at VQS Lod: -12.2518 <= x < -2.796">
##FILTER=<ID=VQSRTrancheINDEL99.95to100.00,Description="Truth sensitivity tranche level for INDEL model at VQS Lod: -91515.6585 <= x < -32.0217">
##FILTER=<ID=LowQual,Description="Low quality">
##FILTER=<ID=InbreedingCoeff,Description="InbreedingCoeff < -0.3">
##FILTER=<ID=VQSRTrancheINDEL99.95to100.00+,Description="Truth sensitivity tranche level for INDEL model at VQS Lod < -91515.6585">
##FILTER=<ID=VQSRTrancheSNP99.95to100.00+,Description="Truth sensitivity tranche level for SNP model at VQS Lod < -292808.5957">
##FILTER=<ID=VQSRTrancheSNP99.95to100.00,Description="Truth sensitivity tranche level for SNP model at VQS Lod: -292808.5957 <= x < -34.5312">
##FILTER=<ID=VQSRTrancheINDEL99.90to99.95,Description="Truth sensitivity tranche level for INDEL model at VQS Lod: -32.0217 <= x < -19.6278">
##FILTER=<ID=VQSRTrancheSNP99.90to99.95,Description="Truth sensitivity tranche level for SNP model at VQS Lod: -34.5312 <= x < -12.2518">
##INFO=<ID=AC,Number=A,Type=Integer,Description="Allele count in genotypes, for each ALT allele, in the same order as listed">
##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency, for each ALT allele, in the same order as listed">
##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">
##INFO=<ID=BaseQRankSum,Number=1,Type=Float,Description="Z-score from Wilcoxon rank sum test of Alt Vs. Ref base qualities">
##INFO=<ID=ClippingRankSum,Number=1,Type=Float,Description="Z-score From Wilcoxon rank sum test of Alt vs. Ref number of hard clipped bases">
##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP Membership">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth; some reads may have been filtered">
##INFO=<ID=DS,Number=0,Type=Flag,Description="Were any of the samples downsampled?">
##INFO=<ID=END,Number=1,Type=Integer,Description="Stop position of the interval">
##INFO=<ID=ExcessHet,Number=1,Type=Float,Description="Phred-scaled p-value for exact test of excess heterozygosity">
##INFO=<ID=FS,Number=1,Type=Float,Description="Phred-scaled p-value using Fisher's exact test to detect strand bias">
##INFO=<ID=HaplotypeScore,Number=1,Type=Float,Description="Consistency of the site with at most two segregating haplotypes">
##INFO=<ID=InbreedingCoeff,Number=1,Type=Float,Description="Inbreeding coefficient as estimated from the genotype likelihoods per-sample when compared agains
t the Hardy-Weinberg expectation">
##INFO=<ID=MLEAC,Number=.,Type=Integer,Description="Maximum likelihood expectation (MLE) for the allele counts (not necessarily the same as the AC), for each
ALT allele, in the same order as listed">
##INFO=<ID=MLEAF,Number=.,Type=Float,Description="Maximum likelihood expectation (MLE) for the allele frequency (not necessarily the same as the AF), for eac
h ALT allele, in the same order as listed">
##INFO=<ID=MQ,Number=1,Type=Float,Description="RMS Mapping Quality">
##INFO=<ID=MQ0,Number=1,Type=Integer,Description="Total Mapping Quality Zero Reads">
##INFO=<ID=MQRankSum,Number=1,Type=Float,Description="Z-score From Wilcoxon rank sum test of Alt vs. Ref read mapping qualities">
##INFO=<ID=NEGATIVE_TRAIN_SITE,Number=0,Type=Flag,Description="This variant was used to build the negative training set of bad variants">
##INFO=<ID=POSITIVE_TRAIN_SITE,Number=0,Type=Flag,Description="This variant was used to build the positive training set of good variants">
##INFO=<ID=QD,Number=1,Type=Float,Description="Variant Confidence/Quality by Depth">
##INFO=<ID=RAW_MQ,Number=1,Type=Float,Description="Raw data for RMS Mapping Quality">
##INFO=<ID=ReadPosRankSum,Number=1,Type=Float,Description="Z-score from Wilcoxon rank sum test of Alt vs. Ref read position bias">
##INFO=<ID=SOR,Number=1,Type=Float,Description="Symmetric Odds Ratio of 2x2 contingency table to detect strand bias">
##INFO=<ID=VQSLOD,Number=1,Type=Float,Description="Log odds of being a true variant versus being false under the trained gaussian mixture model">
##INFO=<ID=culprit,Number=1,Type=String,Description="The annotation which was the worst performing in the Gaussian mixture model, likely the reason why the v
ariant was filtered out">
##INFO=<ID=wasSplit,Number=0,Type=Flag,Description="Specifies that the variant was split from a multi-allelic site">
##reference=file:///cromwell_root/broad-references/hg38/v0/Homo_sapiens_assembly38.fasta
##INFO=<ID=SLO,Number=0,Type=Flag,Description="Has SubmitterLinkOut - From SNP->SubSNP->Batch.link_out">
##INFO=<ID=NSF,Number=0,Type=Flag,Description="Has non-synonymous frameshift A coding region variation where one allele in the set changes all downstream amino acids. FxnClass = 44">
##INFO=<ID=R3,Number=0,Type=Flag,Description="In 3' gene region FxnCode = 13">
##INFO=<ID=R5,Number=0,Type=Flag,Description="In 5' gene region FxnCode = 15">
##INFO=<ID=NSN,Number=0,Type=Flag,Description="Has non-synonymous nonsense A coding region variation where one allele in the set changes to STOP codon (TER). FxnClass = 41">
##INFO=<ID=NSM,Number=0,Type=Flag,Description="Has non-synonymous missense A coding region variation where one allele in the set changes protein peptide. FxnClass = 42">
##INFO=<ID=G5A,Number=0,Type=Flag,Description=">5% minor allele frequency in each and all populations">
##INFO=<ID=COMMON,Number=1,Type=Integer,Description="RS is a common SNP. A common SNP is one that has at least one 1000Genomes population with a minor allele of frequency >= 1% and for which 2 or more founders contribute to that minor allele frequency.">
##INFO=<ID=RS,Number=1,Type=Integer,Description="dbSNP ID (i.e. rs number)">
##INFO=<ID=RV,Number=0,Type=Flag,Description="RS orientation is reversed">
##INFO=<ID=TPA,Number=0,Type=Flag,Description="Provisional Third Party Annotation(TPA) (currently rs from PHARMGKB who will give phenotype data)">
##INFO=<ID=CFL,Number=0,Type=Flag,Description="Has Assembly conflict. This is for weight 1 and 2 variant that maps to different chromosomes on different assemblies.">
##INFO=<ID=GNO,Number=0,Type=Flag,Description="Genotypes available. The variant has individual genotype (in SubInd table).">
##INFO=<ID=VLD,Number=0,Type=Flag,Description="Is Validated. This bit is set if the variant has 2+ minor allele count based on frequency or genotype data.">
##INFO=<ID=ASP,Number=0,Type=Flag,Description="Is Assembly specific. This is set if the variant only maps to one assembly">
##INFO=<ID=ASS,Number=0,Type=Flag,Description="In acceptor splice site FxnCode = 73">
##INFO=<ID=G5,Number=0,Type=Flag,Description=">5% minor allele frequency in 1+ populations">
##INFO=<ID=OM,Number=0,Type=Flag,Description="Has OMIM/OMIA">
##INFO=<ID=PMC,Number=0,Type=Flag,Description="Links exist to PubMed Central article">
##INFO=<ID=SSR,Number=1,Type=Integer,Description="Variant Suspect Reason Codes (may be more than one value added together) 0 - unspecified, 1 - Paralog, 2 -
byEST, 4 - oldAlign, 8 - Para_EST, 16 - 1kg_failed, 1024 - other">
##INFO=<ID=RSPOS,Number=1,Type=Integer,Description="Chr position reported in dbSNP">
##INFO=<ID=HD,Number=0,Type=Flag,Description="Marker is on high density genotyping kit (50K density or greater). The variant may have phenotype associations
present in dbGaP.">
##INFO=<ID=PM,Number=0,Type=Flag,Description="Variant is Precious(Clinical,Pubmed Cited)">
##bcftools_annotateCommand=annotate -x FORMAT chr22.vcf; Date=Thu Jul 21 22:27:22 2022
##bcftools_annotateCommand=annotate -x FORMAT --force -Oz chr22_annotate_hg38.vcf.gz; Date=Tue Sep 6 09:31:37 2022
##bcftools_annotateCommand=annotate -x FORMAT chr22_annotate_hg38.vcf.gz; Date=Mon Oct 17 12:20:29 2022
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT GTEX-1117F GTEX-111CU GTEX-111FC GTEX-111VG GTEX-111YS GTEX-
1122O GTEX-1128S GTEX-113IC GTEX-113JC GTEX-117XS GTEX-117YW GTEX-117YX GTEX-1192W GTEX-1192X GTEX-11DXX GTEX-11DXZ GTEX-11DYG GTEX-11DZ1 GTEX-11EI6 GTEX-11EM3 GTEX-11EMC GTEX-11EQ8 GTEX-11EQ9 GTEX-11GS4 GTEX-11GSO
GTEX-11GSP GTEX-11I78 GTEX-11ILO GTEX-11LCK GTEX-11NSD GTEX-11NUK GTEX-11NV4 GTEX-11O72 GTEX-11OF3 GTEX-11ONC GTEX-11P7K GTEX-11P81 GTEX-11P82 GTEX-11PRG GTEX-11TT1 GTEX-11TTK GTEX-11TUW GTEX-11UD1 GTEX-11UD2 GTEX-11VI4 GTEX-11WQC GTEX-11WQK GTEX-11XUK GTEX-11ZTS GTEX-11ZTT GTEX-11ZU8 GTEX-11ZUS GTEX-11ZVC GTEX-1211K GTEX-12126 GTEX-1212Z GTEX-12584 GTEX-12696 GTEX-1269C GTEX-12C56 GTEX-12KS4 GTEX-12WS9 GTEX-12WSA GTEX-12WSB
GTEX-12WSD GTEX-12WSE GTEX-12WSF GTEX-12WSG GTEX-12WSH GTEX-12WSI GTEX-12WSJ GTEX-12WSK GTEX-12WSL GTEX-12WSM GTEX-12WSN GTEX-12ZZW GTEX-12ZZX GTEX-12ZZY GTEX-12ZZZ GTEX-13111 GTEX-13112 GTEX-13113 GTEX-1313W GTEX-1314G GTEX-131XE GTEX-131XF GTEX-131XG GTEX-131XH GTEX-131XW GTEX-131YS GTEX-132AR GTEX-132NY GTEX-132Q8 GTEX-132QS GTEX-1339X GTEX-133LE GTEX-1399Q GTEX-1399R GTEX-1399S GTEX-1399T GTEX-1399U GTEX-139D8 GTEX-139T4
GTEX-139T6 GTEX-139TS GTEX-139TT GTEX-139TU GTEX-139UC GTEX-139UW GTEX-139YR GTEX-13CF2 GTEX-13CF3 GTEX-13CIG GTEX-13CZU GTEX-13CZV GTEX-13D11 GTEX-13FH7 GTEX-13FHO GTEX-13FHP GTEX-13FLV GTEX-13FLW GTEX-13FTW GTEX-13FTX GTEX-13FTY GTEX-13FTZ GTEX-13FXS GTEX-13G51 GTEX-13IVO GTEX-13JUV GTEX-13JVG GTEX-13N11 GTEX-13N1W GTEX-13N2G GTEX-13NYB GTEX-13NYC GTEX-13NZ8 GTEX-13NZ9 GTEX-13NZA GTEX-13NZB GTEX-13O21 GTEX-13O3O GTEX-13O3P
GTEX-13O3Q GTEX-13O61 GTEX-13OVG GTEX-13OVH GTEX-13OVI GTEX-13OVJ GTEX-13OVK GTEX-13OVL GTEX-13OW5 GTEX-13OW6 GTEX-13OW7 GTEX-13OW8 GTEX-13PDP GTEX-13PL6 GTEX-13PL7 GTEX-13PLJ GTEX-13PVQ GTEX-13PVR GTEX-13QBU GTEX-13QIC GTEX-13QJ3 GTEX-13QJC GTEX-13RTJ GTEX-13RTK GTEX-13RTL GTEX-13S7M GTEX-13S86 GTEX-13SLW GTEX-13SLX GTEX-13U4I GTEX-13VXT GTEX-13VXU GTEX-13W3W GTEX-13W46 GTEX-13X6H GTEX-13X6I GTEX-13X6J GTEX-13X6K GTEX-13YAN
GTEX-1445S GTEX-144FL GTEX-144GL GTEX-144GM GTEX-145LU GTEX-145MF GTEX-145MG GTEX-145MH GTEX-145MI GTEX-145MN GTEX-145MO GTEX-146FH GTEX-146FQ GTEX-146FR GTEX-14753 GTEX-1477Z
chr22 10510212 rs1452389754 A T 407.41 VQSRTrancheSNP99.80to99.90 AC=11;AF=0.0240175;AN=458;BaseQRankSum=0.437;ClippingRankSum=0.212;DP=1221;ExcessHet=0;FS=0;InbreedingCoeff=0.1007;MLEAC=9;MLEAF=0.018;MQ=18.93;MQRankSum=-1.855;NEGATIVE_TRAIN_SITE;QD=15.67;ReadPosRankSum=0;SOR=1.697;VQSLOD=-8.428;culprit=DP;ASP;RS=1452389754;RSPOS=10510212;SAO=0;SSR=0;TOPMED=0.65805778542303771,0.34194221457696228;VC=SNV;VP=0x050000000005000002000100;WGT=1;dbSNPBuildID=151 GT ./. ./. 0/0 ./. ./. ./. 0/0 ./. 0/0 ./. ./. ./. ./. ./. ./. 0/0
0/0 ./. ./. ./. ./. ./. ./. 0/0 0/0 ./. ./. ./. ./. 0/0 0/0 ./. ./. ./. ./. ./. ./. 0/0 ./. ./. 0/0 0/0 0/0 ./. 0/0 ./. ./. ./. ./. ./. ./. ./. ./. ./. ./.
./. ./. ./. ./. 0/0 0/0 ./. ./. ./. 0/0 0/0 0/0 ./. 1/1 ./. ./. ./. 0/0 ./. ./. 0/0 ./. ./. ./. 0/0 0/0 0/0 0/0 ./. ./. ./. ./. ./. ./. ./. ./. ./. ./. ./.
0/0 0/0 ./. ./. 0/0 ./. ./. ./. ./. ./. ./. ./. ./. ./. 0/0 ./. 0/0 ./. ./. ./. ./. 0/0 ./. ./. 0/0 0/0 ./. ./. ./. 0/0 ./. 0/0 ./. ./. 0/0 ./. ./. ./. ./.
0/0 ./. ./. 0/0 ./. ./. 0/0 ./. ./. ./. ./. ./. 0/0 0/1 ./. ./. 0/0 0/0 ./. ./. ./. 0/0 ./. ./. ./. ./. ./. ./. ./. 0/0 ./. ./. 0/0 ./. ./. ./. ./. ./. 0/0
0/0 ./. ./. ./. ./. ./. 0/0 ./. 0/0 ./.
How does your VCF look like? Could you edit your post and add an example line?
I have added my vcf file header