How to rename chromosome_position column in a Beagle file and match it with the index fai?
1
0
Entering edit mode
2.2 years ago
beausoleilmo ▴ 600

I need to rename my chromosome_position column since a program that I use don't allow multiple underscores in the chromosome name (causing a parsing issue).

My indexed genome looks like this:

head my.fna.fai
NC_044571.1     115307910       88      80      81
NC_044572.1     151975198       116749435       80      81
NC_044573.1     113180500       270624411       80      81
NC_044574.1     71869398        385219756       80      81
[...]

The bealgle file looks like this:

zcat MM.beagle.gz | head | cut -f 1-3
marker  allele1 allele2
NC_044592.1_3795        G       T
NC_044592.1_3796        G       T
NC_044592.1_3801        T       C
NC_044592.1_3802        G       A
[...]

In R I can get the chromosome and position:

beag = read.table("MM.beagle.gz", header = TRUE)
chr=gsub("_\\d+$", "", beag$marker)
pos=gsub("^[A-Z]*_[0-9]*.[0-9]_", "", beag$marker)

But I'm not able to rename the beagle file in-place. I'd like to rename all contigs in the .fai file from 1:nrow(my.fna.fai) and match it to the beagle file. So in the end the .fai should look like:

head my.fna.fai
1     115307910       88      80      81
2     151975198       116749435       80      81
3     113180500       270624411       80      81
4     71869398        385219756       80      81
[...]

And the beagle file:

zcat MM.beagle.gz | head | cut -f 1-3
marker  allele1 allele2
22_3795        G       T
22_3796        G       T
22_3801        T       C
22_3802        G       A
[...]

where 22_3795 is the concatenation of the contig 22 and the position 3795, separated with an _.

The solution would preferentially be in bash as R is not practical due to the large file size of my final compressed beagle file (>210GB)

chromosome index fai regex beagle • 1.0k views
ADD COMMENT
2
Entering edit mode
2.2 years ago

It might be easiest to first prepare a mapping from your contig names to new names without underscores and then just apply that mapping to your files.

Since it looks like you are working with this genome assembly, you might want to use the short names from there ("chr1", ..., "chr29", "chr4A", "chrZ") rather than your own sequential numbering. You could create a mapping from the GenBank accessions to the shorter sequence names using the file linked from the page above:

curl -s https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/901/933/205/GCF_901933205.1_STF_HiC/GCF_901933205.1_STF_HiC_assembly_report.txt |
    awk 'BEGIN { FS = OFS = "\t" } !/^#/ { print $7, $1 }' > mapfile.txt

The tab-separated mapping file would look like this:

NC_044571.1 chr1
NC_044572.1 chr2
NC_044573.1 chr3
NC_044574.1 chr4
[...]

Then you could use a string-replacing AWK program for applying that mapping to your files. Something like the following might even be generic enough to be useful later too:

#!/usr/bin/awk -f
# Replaces the leftmost longest substring matching REGEXP on each line 
# according to replacements in tab-separated MAPFILE.
#
# Usage: awk -f rename.awk -v regexp=REGEXP -v mapfile=MAPFILE
#
# Options:
#   -v field=N      Replace in tab-separated field number N only
#   -v header=N     Output N header lines without replacing

BEGIN {
    FS = OFS = "\t";
    while (getline < mapfile) {
        map[$1] = $2;
    }
}

FNR > header && match($field, regexp) {
    name = substr($field, RSTART, RLENGTH);
    if (!(name in map)) {
        print "Error: Cannot map: " name > "/dev/stderr"; 
        exit 1; 
    }
    sub(regexp, map[name], $field);
}

{
    print;
}

Then you could apply the mapping to your fasta index file like this:

awk -f rename.awk -v field=1 -v regexp='^NC_[0-9]+\\.[0-9]+' -v mapfile=mapfile.txt my.fna.fai > renamed.fna.fai

The result would be like:

chr1    115307910   88  80  81
chr2    151975198   116749435   80  81
chr3    113180500   270624411   80  81
chr4    71869398    385219756   80  81
[...]

And applying it to the other file:

zcat MM.beagle.gz |
    awk -f rename.awk -v field=1 -v header=1 -v regexp='^NC_[0-9]+\\.[0-9]+' -v mapfile=mapfile.txt |
    gzip > renamed.beagle.gz

Which would yield:

marker  allele1 allele2
chr22_3795  G   T
chr22_3796  G   T
chr22_3801  T   C
chr22_3802  G   A
[...]
ADD COMMENT

Login before adding your answer.

Traffic: 1784 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6