Entering edit mode
3.7 years ago
Julia_W
•
0
I have a txt file showing below,but the txt file is not correctly formatted to the .vcf standards.Is there any way to convert it to .vcf efficiently except for modifying manually?
I'm not sure - your file lacks so much information required for vcfs . Starting with chromosomal coordinates. Did you look at the vcf specs? You may have a file that contains chromosomal coordinates for your markers and this file might be a better start.
What if I have this genotyping txt file and the other txt file that contains chromosomal coordinates and the physical position of markers as showing below?Is there any possible to convert these information to vcf file by tools or something? By the way,QUAL,FILTER and INFO in vcf format are ignored.
Based on my incomplete knowledge and the limited information you provide I fear I'm no help - though both look a bit like some roll-your-own format to me and hence tools might be a bit out of question.
That's the good ol' bioinformatics way. Connect a roll-your-own format using a roll-your-own solution to a standard format. I recommend Python Dictionaries and Biopython as a versatile solution, I could see this would be possible via join tables using R's dplyr.
Quite frankly I don't even understand how the duplicated marker column header is supposed to connect to your coordinates file. Is Marker1 with value 20 the same coordinate as Marker1 value 24? Provided, it's just some counts. you'll end up with file looking much more like the second, just pivoted by markers. All the information will have to end up in the INFO field which then is ignored. And finally you lack the nucleotides at the coordinates. There's just too many open questions...
I'm guessing that the marker column is duplicated because it's trying to represent a diploid genome. But yeah, I don't think vcf format can handle alleles abstracted to numbers. It expects exact nucleotide sequences.
Thanks for your recommendation.I will try it.The duplicated marker column,as swbarnes2 said,represented a diploid genotyping.You can think of these values as ATCG.INFO field,as far as I know,could be the missing value specified with a dot(".").And FORMAT field will be GT.