Question

Need help to figure out what format this SNP calling data is.

0

Entering edit mode

3 months ago

yihangzhu • 0

I got some data in tab delimited txt, containing the SNP calling results from DNA microarrays. It looks like this:

ID  P1  P2  chrom   position    ref CJ048   CJ364   CJ049   CJ369   CJ094   ...
C1_20920    AA  CC  C1  20920   C   AA  CA  CA  CC  NA  ...
C1_20925    CC  TT  C1  20925   C   CC  CT  CT  TT  CT  ...
C1_108997   TT  CC  C1  108997  T   TT  TC  TC  CC  TC  ...
C1_649064   delATTT AA  C1  649064  A   delATTT ATTT/delATTT    AA  AA  ATTT/delATTT    ...
C1_7314118  GG  AA  C1  7314118 A   NA  NA  NA  NA  NA  ...
C1_7384766  CC  NA  C1  7384766 A   AA/delAA    CC  AA  AA  AA  ...
C1_10284712 TT  insACTC C1  10284712    T   T/insACTC   T/insACTC   T/insACTC   T/insACTC   T/insACTC   ...
...
C2_9230073  AA  insG    C2  9230073 A   A/insG  A/insG  A/insG  AA  AA  ...
C2_9249942  G/insA  GG  C2  9249942 G   G/insA  G/insA  G/insA  G/insA  G/insA  ...
...
C3_50109828 GG  delGT   C3  50109828    G   GT/delGT    GG  GT/delGT    GT/delGT    GT/delGT    ...
...
C4_4465814  insAA   TT  C4  4465814 T   TT  insAA   TT  T/insAA TT  ...
...

All samples are F2 populations of one couple of parents, and the P1 and P2 columns is the parents' SNPs of corresponding loci. So the first six columns make sense, and the following columns are the samples' SNP calling.

Although I can read the data, I have no idea what format this is and how to deal with it further. It looks like hapmap but there are too many ins/del and some rare notations. It seems impossible to analysis it using any existing software or tools, e.g. plink, bcftools, tassel, ...

I doubt this file is in self-designed format from a bio company's analysis report. I asked the one who gave me this file, who said that is all he received. Anybody got ideas how to analysis the data further?

SNP microarray • 373 views

ADD COMMENT • link updated 3 months ago by GenoMax 148k • written 3 months ago by yihangzhu • 0

0

Entering edit mode

Your inference appears to be mostly on the point. Since you did not get any answers here is what ChatGPT says about example you provided.

However, this isn't a standard VCF file but rather a tabular format with some custom columns. Here's a breakdown of the columns:

D: This likely represents the variant ID or marker name, such as C1_20920.
P1: The genotype of the first parent (e.g., AA, CC).
P2: The genotype of the second parent.
chrom: The chromosome on which the variant is located (e.g., C1).
position: The genomic position of the variant on the chromosome (e.g., 20920).
ref: The reference allele at this position in the reference genome (e.g., C or T).
CJ048, CJ364, CJ049, CJ369, CJ094, ...: These columns represent the genotypes of different samples at the given position. The entries show the observed alleles in each sample (e.g., AA, CA, CC, NA, where NA could represent missing data).

Key Observations:

Genotype Representation: The genotypes are represented in pairs (e.g., AA, CC, CA, TT, CT, NA). In some cases, you have insertions or deletions represented as delATTT or ATTT/delATTT.
Variants: The data includes single nucleotide polymorphisms (SNPs) as well as small insertions and deletions (indels).
NA: Represents missing data for a particular sample at that position.

When asked about conversion to VCF format GPT suggests doing following

Original line

C1_20920 AA CC C1 20920 C AA CA CA CC NA

VCF data line:

C1 20920 C1_20920 C A . . GT 0/0 0/1 0/1 1/1 ./.

Final VCF structure

##fileformat=VCFv4.2
##source=YourDataSource
#CHROM  POS ID  REF ALT QUAL FILTER INFO FORMAT CJ048 CJ364 CJ049 CJ369 CJ094 ...
C1      20920   C1_20920  C   A   .    .     .    GT    0/0  0/1  0/1  1/1  ./.

ADD REPLY • link 3 months ago by GenoMax 148k