I need some help understanding the structure of a variant call file, with Affymetrix data
I have a RAW data file in the following format:
probeset_id CEL_call_code chromosome position rsid
AFFX-SP-000001 CC 10 121336954 rs10466213
AFFX-SP-000002 CG 12 23048418 rs10770943
AFFX-SP-000004 GG 17 56334747 rs11079221
AFFX-SP-000005 GG 11 85910686 rs12285109
AFFX-SP-000006 CG 15 60865412 rs12913890
The problem I have is that some identical SNPs have different call codes, and in some situation they are in same position or different position, like for these two examples:
Same position with different CEL call code
probeset_id CEL_call_code chromosome position rsid
AX-96108113 AC 4 6301295 rs1801214
AX-96108115 TC 4 6301295 rs1801214
Different position with different CEL call code
probeset_id CEL_call_code chromosome position rsid
AX-123355923 CACA 7 117642463 rs121908784
AX-96064890 AA 7 117642464 rs121908784
How is this possible, and how do I know which one is the CORRECT CEL call code for these SNPs which are multiple times in the same file.
Thank you, any suggestion would be very much appreciated.
How do I found out which array version it is?
What I need to do is to make a script to extract the corect SNP and their genotype. Because same SNPs have different call code I don't know how I can chose which one is the corect one.
You mentioned normalization, but this file shouldn't be already normalized?
Well, from where did you obtain this data? The array version is likely stored in the CEL file header information, which may or may not be accessible.What is the ultimate aim of your work?
I need to transform the file from this format
into soemthing like
You can do that in Shell scripting using
cut
orawk
:The problem I have is that some SNPs in the same files on different rows have different genotype so I don;t know which one is the correct Genotype
The above example is a real example, the RS1801214 on one line has T/C and on the other line has A/C. From what I know one person cannot have one SNP with different values, which means that one of this value T/C or A/C should be ignored or maybe merged!?!
A person could indeed have both genotypes present if they inherited one from their mother and the other from their father. In this case, both of these probes would fluoresce and return signal above the background threshold.
I do not know what your ultimate aim is, so, cannot really comment further. Note that each SNP will have an associated allele frequency, indicating its frequency in a given population.
I don't understand how this is possible. In this case why all the genotypes contain two letters ( in my above example AA or TC ) instead of one?
In your example, the sequences are AC and TC. Each of us carries 2 copies of each autosomal chromosome. Based on the fusion of gametes, one from the mother and one from the father, our DNA can differ at individual bases. For example, at rs1801214, you may inherit
A
from your mother andT
from your father.Without further information on what you are trying to do and from where you obtained your data, I cannot really help you any further.
Are you saying that when it says AC, it doesn't mean heterozygous A from one parent and C from another parent, but that the probe picked up AC at location 4:6301295-6301296?
I find that hard to believe, as most of the lines in the file I am loking at are two letters that are the same, which would mean that it is giving A/A homozygous when it says AA
To me it just looks like that the file type doesn't put a slash between the two strands.
so that would be GGAG/GGAG, G/G and CTGA/CTGA. You wouldn't need two lines with the same rsID to tell if it was homozygous or heterozygous at each location.