Question

Replacing the Chr names and position notions in vcf

1

Entering edit mode

6.8 years ago

umermehar10 ▴ 10

Dear All I have a combined VCF file of few individuals. In this VCF instead of having normal CHR1, chr2 notions for chromosomes it is having the chromosome information as

gi|996703411|ref|NW_015379183.1|, gi|996703411|ref|NW_015379175.1

In this notion NW_015379183.1 corresponds to a specific Chromosome. The same is true for its positions, If I have the chromosome numbers for all gi|996703411|ref|NW_015379183.1| sort of notions how I can replace the chromosome names to the original names.

VCF Chromosome Name replacement • 15k views

ADD COMMENT • link updated 6.8 years ago by Bastien Hervé 5.9k • written 6.8 years ago by umermehar10 ▴ 10

0

Entering edit mode

see VCF files: Change Chromosome Notation

ADD REPLY • link 6.8 years ago by Pierre Lindenbaum 164k

2

Entering edit mode

6.8 years ago

Bastien Hervé 5.9k

As said above bcftools will be the best way to do it. If you want a code version here is mine

Assuming that you have a match_table.txt file like this (separate by tab) :

gi|996703411|ref|NW_015379183.1| chr1

gi|996703411|ref|NW_015379184.1| chr2

gi|996703411|ref|NW_015379185.1| chr3

gi|996703411|ref|NW_015379186.1| chr4

gi|996703411|ref|NW_015379187.1| chr5

Coding version in python :

###Create a dictionnary containing your match_table.txt
match_dict={}
###Open your match table
with open("matching_table.txt", 'r') as match_f:
    ###For each line, you create a key/value item in a dictionnary
    for line in match_f:
        gi_notation = line.rstrip().split("\t")[0]
        chr_notation = line.rstrip().split("\t")[1]
        ###Check if the key doesn't exist in the dictionnary
        if gi_notation not in match_dict:
            match_dict[gi_notation] = chr_notation
        else:
            print("Care, duplicate in matching_table.txt, on : "+str(gi_notation))

###Open your vcf file
new_vcf_file = open("your_new_vcf_file.vcf", "a")
with open("your_vcf_file.vcf", 'r') as vcf_f:
    ###Read it line by line
    headers_chromosome = ""
    for line in vcf_f:
        ###Change VCF dictionnary headers
        if line.startswith('##contig'):
            ###Get chromosome name
            headers_chromosome = line.split("=")[2].split(",")[0]
        ###If your chromosome exist in your dictionnary
        if headers_chromosome in match_dict:
            ###Replace in chromosome name in line
            line = line.replace(headers_chromosome, match_dict[headers_chromosome])
        ###Skip metadata informations
        if line[0] != '#':
            ###Retrieve your chromosome for each line
            current_chromosome = line.split("\t")[0]
            ###If your chromosome exist in your dictionnary
            if current_chromosome in match_dict:
                ###Change the value of your chromosome
                new_vcf_file.write(match_dict[current_chromosome]+"\t"+'\t'.join(line.split("\t")[1:]))
                ###Your chromosome is not in your dictionnary (I write it as it is but you can do something else...)
            else:
                print("This chromosome is not in my matching_table.txt : "+str(current_chromosome))
                new_vcf_file.write(line)
        ###Write unchanged metadata
        else:
            new_vcf_file.write(line)
new_vcf_file.close()

ADD COMMENT • link 6.8 years ago by Bastien Hervé 5.9k

0

Entering edit mode

Dear I am trying with the script you have generously created for me, But I am receiving the following error IndentationError: unexpected indent

    Chr1 = line.rstrip().split("\t")[1]
      
although I am trying to maintain the correct indentation in command

for line in match_f:
    gi_notation = line.rstrip().split("\t")[0]
    chr_notation = line.rstrip().split("\t")[1]

ADD REPLY • link 6.8 years ago by umermehar10 ▴ 10

0

Entering edit mode

I found it, it works well for me now. Note that, it is easier to find an error by providing the error message and the line involved. As Ram said, if you want to go deeper in file manipulation, you have to try coding by your own. Thereby, I let you some comments in the code '###' to have a better understanding on the process. This code will only work for what you ask but you can reuse some line to write a new script, like opening a file (with open("your_vcf_file.vcf", 'r') as vcf_f) or loop on the different lines (for line in match_f)

ADD REPLY • link 6.8 years ago by Bastien Hervé 5.9k

0

Entering edit mode

6.8 years ago

Ram 44k

Create a file with 2 columns - the gi| notation and the notation you want.

Read the above file to a dictionary. Then, read the VCF file in a streaming fashion and substitute each occurrence of the old notation with the new one.

This is the strategy. I'd recommend you coding this yourself so you can use it as a learning experience.

ADD COMMENT • link 6.8 years ago by Ram 44k

score 6 · Accepted Answer · 2018-02-16

6

Entering edit mode

6.8 years ago

Pierre Lindenbaum 164k

use bcftools annotate https://samtools.github.io/bcftools/bcftools.html

with

--rename-chrs file rename chromosomes according to the map in file, with "old_name new_name\n" pairs separated by whitespaces, each on a separate line.

ADD COMMENT • link 6.8 years ago by Pierre Lindenbaum 164k