I need to rename my chromosome_position
column since a program that I use don't allow multiple underscores in the chromosome name (causing a parsing issue).
My indexed genome looks like this:
head my.fna.fai
NC_044571.1 115307910 88 80 81
NC_044572.1 151975198 116749435 80 81
NC_044573.1 113180500 270624411 80 81
NC_044574.1 71869398 385219756 80 81
[...]
The bealgle file looks like this:
zcat MM.beagle.gz | head | cut -f 1-3
marker allele1 allele2
NC_044592.1_3795 G T
NC_044592.1_3796 G T
NC_044592.1_3801 T C
NC_044592.1_3802 G A
[...]
In R I can get the chromosome and position:
beag = read.table("MM.beagle.gz", header = TRUE)
chr=gsub("_\\d+$", "", beag$marker)
pos=gsub("^[A-Z]*_[0-9]*.[0-9]_", "", beag$marker)
But I'm not able to rename the beagle file in-place. I'd like to rename all contigs in the .fai
file from 1:nrow(my.fna.fai)
and match it to the beagle file. So in the end the .fai
should look like:
head my.fna.fai
1 115307910 88 80 81
2 151975198 116749435 80 81
3 113180500 270624411 80 81
4 71869398 385219756 80 81
[...]
And the beagle file:
zcat MM.beagle.gz | head | cut -f 1-3
marker allele1 allele2
22_3795 G T
22_3796 G T
22_3801 T C
22_3802 G A
[...]
where 22_3795
is the concatenation of the contig 22
and the position 3795
, separated with an _
.
The solution would preferentially be in bash as R is not practical due to the large file size of my final compressed beagle file (>210GB)
cross posted : https://stackoverflow.com/questions/73975690/how-to-rename-chromosome-position-column-in-a-beagle-file-and-match-it-with-the