I have obtained genome matrix file for F1 hybrids of the model organism I am working with. My main goal is to identify/pull the polymorphism database from this file set. But, I am not finding any useful tool on the web (using google search) that can help me with handling this type of data. Additionally, I have other problems: The file came as gzipped file (819 mb) and upon extraction turned to be 14.5 gb (such a great compression/expansion) and has txt extension. I tried using gvim to view the file just to understand what kind of information/header it has but it leads my computer to freeze. I am using Ubuntu with 12 gb ram and 10 gb swap memory and it doesn't seem to be enough for the file.
Thanks in advance
New Question added:
I have been able to observe the data structure and extract a portion of the file. Thanks for the help you people have been.
The portion of the data structure looks like below (only 9 columns out of 34 shown, and remaining 25 are removed just for convenience of viewing) and is tab separated.
1 1 T N N N N N N
1 2 C N N N N N N
1 3 A N N N N N N
1 5 T N N N N N N
..
..
1 99 C C C C C N C
1 100 A A A A A N A
1 101 T T T T T T T
..
and so on.
The columns indicate following information:
- Column #1 - Chromosome number, and there are altogether 8 chromosomes
- Column #2 - read position
- Column #3 - nucleotide base at that particular position for the reference genome
- Column #4....34 - indicates the nucleotide base at the particular position, which may be same or different from the reference (column #3). Most of the bases in the view match the reference but at other places it varies. N- indicates the missing base.
Questions:
I want to create a vcf file in the following format and with the following header (tab separated).
CHROM POS REF ALT FREQ
So, I will need to check if any allele varies compared to the reference. If it does not the ALT field will be blank for that read position, if not I will have one or more alternate allele at that position. Could anyone help me which tools would useful for this purpose? Also, please provide me with brief guidance on script/example.
When calculate the allele frequency I want to leave the the missing bases out of the calculation - that means if there were 34 biological samples and I had 10 reference, 14 missing (N), and 10 alternate type I and 10 alternate type 2 allele. I want to calculate the allele frequency of reference (which would be 0.33), alt type I (0.33) and alt type 2 (0.33) based solely on available reads (excluding the missing reads).
Thanks much in advance
please define "handling this type of data". as most linux/unix tools are able to do the job by streaming data (pipeline)
By "Handling this type of data" I mean if there is any tool that can take the genome matrix file to do any analyses. Google search didnot land me any useful information on the tool that is available that is helpful in analyzing genome matrix files.
Additionally, I also cannot view the data structure (due to file being very large) which could have been helpful to understand what it really contains, so I was asking if there is some way of extracting the small portion of this large file.
Sorry for any confusion I may have caused.
The method I use to get a view of the data's structure is to
head -n 10 <file>
, then increase the-n
gradually (10 -> 50 -> 100) until I understand how the data is structured.I will try it sometime after I get chance to access the data. Thanks much for the code.