I have been able to observe the data structure and extract a portion of the file. Thanks for the help you people have been.
The portion of the data structure looks like below (only 9 columns out of 34 shown, and remaining 25 are removed just for convenience of viewing) and is tab separated.
1 1 T N N N N N N
1 2 C N N N N N N
1 3 A N N N N N N
1 5 T N N N N N N
contd.....
1 99 C C C C C N C
1 100 A A A A A N A
1 101 T T T T T T T
and so on.
contd- added to show continuity in the data.
The columns indicate following information:
- Column #1 - Chromosome number, and there are altogether 8 chromosomes
- Column #2 - read position
- Column #3 - nucleotide base at that particular position for the reference genome
- Column #4....34 - indicates the nucleotide base at the particular position, which may be same or different from the reference (column #3). Most of the bases in the view match the reference but at other places it varies. N- indicates the missing base.
Question:
I want to create a vcf file in the following format and with the following header (tab separated).
CHROM POS REF ALT FREQ
So, I will need to check if any allele varies compared to the reference. If it does not the ALT field will be blank for that read position, if not I will have one or more alternate allele at that position. Could anyone help me which tools would useful for this purpose? Also, please provide me with brief guidance on script/example.
When calculate the allele frequency I want to leave the the missing bases out of the calculation- that means if there were 34 biological samples and I had 10 reference, 14 missing (N), and 10 alternate type I and 10 alternate type 2 allele. I want to calculate the allele frequency of reference (which would be 0.33), alt type I (0.33) and alt type 2 (0.33) based solely on available reads (excluding the missing reads).
Thanks much in advance