I got some data in tab delimited txt, containing the SNP calling results from DNA microarrays. It looks like this:
ID P1 P2 chrom position ref CJ048 CJ364 CJ049 CJ369 CJ094 ...
C1_20920 AA CC C1 20920 C AA CA CA CC NA ...
C1_20925 CC TT C1 20925 C CC CT CT TT CT ...
C1_108997 TT CC C1 108997 T TT TC TC CC TC ...
C1_649064 delATTT AA C1 649064 A delATTT ATTT/delATTT AA AA ATTT/delATTT ...
C1_7314118 GG AA C1 7314118 A NA NA NA NA NA ...
C1_7384766 CC NA C1 7384766 A AA/delAA CC AA AA AA ...
C1_10284712 TT insACTC C1 10284712 T T/insACTC T/insACTC T/insACTC T/insACTC T/insACTC ...
...
C2_9230073 AA insG C2 9230073 A A/insG A/insG A/insG AA AA ...
C2_9249942 G/insA GG C2 9249942 G G/insA G/insA G/insA G/insA G/insA ...
...
C3_50109828 GG delGT C3 50109828 G GT/delGT GG GT/delGT GT/delGT GT/delGT ...
...
C4_4465814 insAA TT C4 4465814 T TT insAA TT T/insAA TT ...
...
All samples are F2 populations of one couple of parents, and the P1 and P2 columns is the parents' SNPs of corresponding loci. So the first six columns make sense, and the following columns are the samples' SNP calling.
Although I can read the data, I have no idea what format this is and how to deal with it further. It looks like hapmap but there are too many ins/del and some rare notations. It seems impossible to analysis it using any existing software or tools, e.g. plink, bcftools, tassel, ...
I doubt this file is in self-designed format from a bio company's analysis report. I asked the one who gave me this file, who said that is all he received. Anybody got ideas how to analysis the data further?
Your inference appears to be mostly on the point. Since you did not get any answers here is what ChatGPT says about example you provided.
However, this isn't a standard VCF file but rather a tabular format with some custom columns. Here's a breakdown of the columns:
Key Observations:
When asked about conversion to VCF format GPT suggests doing following
Original line
VCF data line:
Final VCF structure