Question

VCF files: Change Chromosome Notation

22

Entering edit mode

10.7 years ago

Quak ▴ 520

There are two VCF files that I like to merge them, using GATK or VCFtools. The problem is, they have different chromosomal notation, one has Chr, the other does not. This question could be similar to this one

Is there any quick awk/sed commands that you suggest ?! Also I appreciate if you make comment, which of these two (GATK/VCFtools) is more reliable for this task.

vcf next-gen sequence • 74k views

ADD COMMENT • link updated 2.2 years ago by Ram 44k • written 10.7 years ago by Quak ▴ 520

5

Entering edit mode

the awk-based answers below are confusing. Just use bcftools annotate --rename-chrs as highlighted by @jerviedog. This will also work with appropriate subsets of NCBI's assembly_report.txt files

ADD REPLY • link updated 2.2 years ago by Ram 44k • written 4.6 years ago by Yannick Wurm ★ 2.5k

0

Entering edit mode

I am very new at this and ran into a similar but slightly more complicated problem today with the Cryptococcus genome. I think I solved it thanks to help and links posted here (and didn't find a solution elsewhere) so thought I should post it here in case someone comes along with a similar problem. The reference genome I use does not use either numerical (1, 2, 3) or chr (chr1, chr2, chr3) notation, it has wacky chromosome names (CP003827, CP003822 etc.). So to replace my chromosome names in a vcf file to make them numerical I used a series of grep commands in awk:

awk '{gsub(/CP003827/, "8"); gsub(/CP003822/, "3"); gsub(/CP003824/, "5");
      gsub(/CP003834/, "Mt"); gsub(/CP003833/, "14"); gsub(/CP003829/, "10"); 
      gsub(/CP003826/, "7"); gsub(/CP003820/, "1"); gsub(/CP003828/, "9");
      gsub(/CP003825/, "6"); gsub(/CP003821/, "2"); gsub(/CP003830/, "11");
      gsub(/CP003832/, "13"); gsub(/CP003831/, "12"); gsub(/CP003823/, "4");
      print;}' original.vcf > original_newchr.vcf

I had no knowledge of awk before stumbling onto this post so there might be a more elegant way to do this, but this seems to work, which is good enough for me!

ADD REPLY • link updated 5.6 years ago by Ram 44k • written 8.9 years ago by acgerstein • 0

5

Entering edit mode

5.6 years ago

Shicheng Guo ★ 9.6k

for i in {1..22} X Y MT
do
echo "chr$i $i" > chr_name_conv.txt
bcftools annotate --rename-chrs chr_name_conv.txt Schrodi_IL23_IL17_combined_RECAL_SNP_INDEL_variants.VA.chr$i.vcf.gz -Oz -o Schrodi_IL23_IL17_combined_RECAL_SNP_INDEL_variants.VA.chr$i.Minimac4.vcf.gz
tabix -p vcf Schrodi_IL23_IL17_combined_RECAL_SNP_INDEL_variants.VA.chr$i.Minimac4.vcf.gz
done

ADD COMMENT • link 5.2 years ago by Shicheng Guo ★ 9.6k

0

Entering edit mode

This has already been addressed by the comment to the most popular answer (C: VCF files: Change Chromosome Notation) - I don't see why this is fit to be a new answer by itself.

ADD REPLY • link 5.6 years ago by Ram 44k

1

Entering edit mode

It fits because the original answer didn't have the for loop. Which actually saves time.

ADD REPLY • link 4.1 years ago by abdallahamr5 ▴ 20

0

Entering edit mode

The loop is unnecessary for a single VCF file. Plus, this is an inefficient loop that renames chromosome by chromosome, so it wastes resources. Unnecessary and wasteful loops are shell abuse.

ADD REPLY • link 4.1 years ago by Ram 44k

1

Entering edit mode

It is more time efficient than echoing the 25 chromosome names:)

ADD REPLY • link 4.1 years ago by abdallahamr5 ▴ 20

0

Entering edit mode

No it's not. Querying a large VCF file 25 times is a LOT more inefficient than typing 25 brief lines or using xargs/parallel instead of the for loop.

That loop is a disaster.

ADD REPLY • link 4.1 years ago by Ram 44k

1

Entering edit mode

I found having the loop helpful.

ADD REPLY • link 3.0 years ago by Manuel Corpas ▴ 270

1

Entering edit mode

5.4 years ago

carmen.bugarin.diz ▴ 10

I had this problem before and I was able to fix it while running some filters to the file. If you are performing any sort of filtering on those VCF files you could use:

sed 's:chr::g'

Example:

vcftools --gzvcf sample.vcf.gz --keep sample_list.txt --mac 1 --recode-to-stream | sed 's:chr::g' | bgzip -c > new_sample.vcf.gz

ADD COMMENT • link updated 5.4 years ago by Ram 44k • written 5.4 years ago by carmen.bugarin.diz ▴ 10

0

Entering edit mode

6.8 years ago

Alex • 0

Next to the above answers, this resource might be helpful: mappings between common notations (UCSC, Ensembl, Gencode) https://github.com/dpryan79/ChromosomeMappings

You can use this together with

awk 'NR==FNR{map[$1]=$2;next} { for (i=1;i<=NF;i++) $i=($i in map ? map[$i] : $i) } 1' mapping_file.txt input.vcf > output.vcf

to batch rename between common chromosomal notations.

ADD COMMENT • link 6.8 years ago by Alex • 0

0

Entering edit mode

2.6 years ago

JustinZhang ▴ 120

To summarize the answers people wrote before, if you have a bunch of vcf files, you don't want and need to know which has 'chr', neither use additional software. just run this in a shell script to add 'chr':

cat ${input_vcf} | grep '#' | awk '{ if(match($0,/(##contig=<ID=)([1-9].*|MT.*|X.*|Y.*)/,m)) print m[1]"chr"m[2]; else print $0}' > ${ouput_vcf}
cat ${input_vcf} | grep -v '#' | awk '{if(match($0,/^chr*/)) print $0; else print "chr"$0}' >> ${ouput_vcf}

To run it faster but not safely (~~jump~~ skip the file if is correct) (suppose the vcf file is not self-contradictory),

if  [ $(cat ${input_vcf} | grep '##contig=<ID=[^G]' |  sed -n '1p' | cut -c14) != c ]; then

    # catch the non-contig header lines
    cat ${input_vcf} | grep '##' | grep -v '^##contig=<ID=[1-9XYM]'  >${ouput_vcf} &&

    # catch the contig header lines and fix it
    cat ${input_vcf} | grep '##' | grep '^##contig=<ID=[1-9XYM]' | sed 's/##contig=<ID=/##contig=<ID=chr/g' >>${ouput_vcf} &&

    # catch the column name line
    cat ${input_vcf} | grep -v '##' | grep '#' >>${ouput_vcf} &&

    # catch the variant record lines
    cat ${input_vcf} | grep -v '#'  | awk '{print "chr"$0}'  | sort -k1,1 -k2,2n >>${ouput_vcf}

fi

Update: fixed the detail, and you can copy to your ~/.bashrc now.

ADD COMMENT • link updated 2.2 years ago by Ram 44k • written 2.6 years ago by JustinZhang ▴ 120

0

Entering edit mode

I guess you mean to "summarize" and not "conclude" - the former makes sense, the latter is just rude. I've removed my comments with that in mind.

In any case and especially for beginners, using well tested tools is better than using custom scripts. I'd recommend bcftools over awk.

ADD REPLY • link 2.6 years ago by Ram 44k

0

Entering edit mode

Following up, I see an error in your commands:

cat ${input_vcf} | grep '##' | grep '#' >>${ouput_vcf} - Unlike what you expect, this won't add just the #CHROM line but every header line except the #CHROM line. Anything that matches ## will automatically match #, and #CHROM won't match ##.

This would not happen if you used bcftools. Everyone (including you and me) that writes scripts makes mistakes. These tools just have been well tested and are widely used.

ADD REPLY • link 2.6 years ago by Ram 44k

Ram · Accepted Answer · 2014-04-22

64

Entering edit mode

10.7 years ago

Vivek ★ 2.7k

awk '{gsub(/^chr/,""); print}' your.vcf > no_chr.vcf

Should get rid of the chr

awk '{if($0 !~ /^#/) print "chr"$0; else print $0}' no_chr.vcf > with_chr.vcf

Will add the chr to the VCF without chr.

You can use either command depending on how the chromosomes are named in your reference.

Regarding preference of tools, if you plan to do downstream processing with GATK, I'd suggest sticking with GATK CombineVariants for consistency.

ADD COMMENT • link updated 5.6 years ago by Ram 44k • written 10.7 years ago by Vivek ★ 2.7k

7

Entering edit mode

Thanks for this Vivek :) Works great!

It does still leave the ID lines in the VCF unmodified though, so I made a little tweak to catch those too:

awk '{ 
        if($0 !~ /^#/) 
            print "chr"$0;
        else if(match($0,/(##contig=<ID=)(.*)/,m))
            print m[1]"chr"m[2];
        else print $0 
      }' no_chr.vcf > with_chr.vcf

I'm also curious if this is ever the right thing to do, since VCFs from genomes with chr (notably the Mouse Genomes Project VCFs) may not be the same as the genome without the chr (notably the UCSC genomes and BAMs mapped to them) even if they both say they are mm10 or something similar. All I'm saying to future reader is to be careful :)

ADD REPLY • link updated 5.6 years ago by Ram 44k • written 8.9 years ago by John 13k

1

Entering edit mode

Yeah, that's bit more sophisticated but I don't think most tools including GATK mind what's in the VCF header as long as the notation in the VCF entries conforms with the reference provided.

ADD REPLY • link 8.9 years ago by Vivek ★ 2.7k

1

Entering edit mode

You're probably right - I was having a different issue with GATK earlier and I thought this might be the reason, but it turned out to be unrelated.

Ho hum. Maybe one day the gods of bioinformatics will decided to drop the "chr"s in all datasets and we can live in peace :P

ADD REPLY • link 8.9 years ago by John 13k

1

Entering edit mode

Or the goddesses will decide to keep "chr"s in all datasets and we can live in peace too (or this could be the cause of another war again?)

ADD REPLY • link 6.4 years ago by Ömer An ▴ 260

0

Entering edit mode

Yes GATK (at least BaseRecalibrator) uses the contigs info headers :-(

The correct sentence to change the header :

awk '{ if(match($0,/(##contig=<ID=)([1-9].*|MT.*|X.*|Y.*)/,m)) print m[1]"chr"m[2]; else print $0}'

This long time problem is very stupid and wastes a lot of time and brain...

ADD REPLY • link 2.7 years ago by nicolas.soriano ▴ 10

0

Entering edit mode

This is the same awk command as in the comment VCF files: Change Chromosome Notation

Why add a new answer? I'll be moving this to a comment because the BaseRecalibrator point is one new piece of information that saves your post from being deleted as a duplicate.

ADD REPLY • link 2.7 years ago by Ram 44k

0

Entering edit mode

Hi Ram, there is a little difference. It adds "chr" only to chromosomes 1 to 9 , X,Y and MT (which should be changed to chrM later...). That avoids adding chr to non canonical contigs (if "by luck" they have the sames names in both files...)

ADD REPLY • link 2.7 years ago by nicolas.soriano ▴ 10

0

Entering edit mode

Good call. I'll move your post as a reply to John's comment. Please edit it and add the detail you mentioned above (adding "chr" to only 1-22|X|Y|MT and not to other contigs)

ADD REPLY • link 2.7 years ago by Ram 44k

0

Entering edit mode

Could you explain a little bit about this part of the command?

{if($0 !~ /^#/)

Because I'm reading it as:

if $0-- if the line contains
!~  -- something NOT EQUAL to
/^#/ -- a number at the beginning of the line

For some reason, I thought it should read the same except without the !

As in

awk '{if($0 ~ /^#/) print "chr"$0; else print $0}' no_chr.vcf > with_chr.vcf

So am I misunderstanding such that ! does not mean "not equal to" and # does not refer to "any number"?

ADD REPLY • link updated 5.6 years ago by Ram 44k • written 8.2 years ago by ms238 • 0

0

Entering edit mode

The !~ operator is not the same as !=. The first specifies a "not-match" operation on a regular expression pattern (in this case, lines that start with a # character). The second operator is a "not-equals" operation on a pair of scalar values, like a pair of numbers or strings. Refer to the documentation for more details: https://www.gnu.org/software/gawk/manual/html_node/Comparison-Operators.html#Comparison-Operators

ADD REPLY • link 8.2 years ago by Alex Reynolds 36k

0

Entering edit mode

Ok, thanks, Alex. I see what it is doing now. It also looks like the formatting in vcf files are not necessarily obvious. As you can see from my subset below, it looks like many lines do not start with a # character, yet, they are not changed to chr1, which is good. Therefore, though they appear to be the start of a line, they are really a continuation of the preceding line. (for example, lines 3 and 4 below.)

1       3006834 .       GA      G       77      PASS    INDEL;DP=419;DP4=5,9,200,205;CSQ=-||||intergenic_variant||||||||        GT:GQ:DP:MQ0F:GP:PL:AN:MQ:DV:DP4:SP:SGB:PV4:FI  1/1:49:38:0:81,49,0:62,43,0:2:41:31:0,7,18,13:21:-0.69311:.:1   1/1:101:30:0:133,101,0:104,90,0:2:60:30:0,0,16,14:0:-0.693097:.:1       1/1:24:11:0:66,24,0:44,16,0:2:45:9:0,2,9,0:17:-0.662043:.:1     1/1:57:19:0:110,57,0:78,44,0:2:60:18:1,0,10,8:0:-0.691153:.:1   ./.:.:.:.:.:.:.:.:.:.:.:.:.:.   ./.:.:.:.:.:.:.:.:.:.:.:.:.:.   1/1:69:20:0:125,69,0:99,60,0:2:60:20:0,0,9,11:0:-0.692067:.:1   ./.:.:.:.:.:.:.:.:.:.:.:.:.:.   ./.:.:.:.:.:.:.:.:.:.:.:.:.:.   ./.:.:.:.:.:.:.:.:.:.:.:.:.:.   1/1:123:40:0:124,133,0:92,120,0:2:58:40:0,0,12,28:0:-0.693145:.:1       ./.:.:.:.:.:.:.:.:.:.:.:.:.:.   1/1:44:15:0:85,44,0:56,33,0:2:59

ADD REPLY • link updated 5.6 years ago by Ram 44k • written 8.2 years ago by ms238 • 0

Ram · Accepted Answer · 2018-08-11

26

Entering edit mode

6.4 years ago

jerviedog ▴ 260

Another way of doing this by using bcftools v1.9:

echo "1 chr1" >> chr_name_conv.txt
echo "2 chr2" >> chr_name_conv.txt
bcftools annotate --rename-chrs chr_name_conv.txt original.vcf.gz | bgzip > rename.vcf.gz

ADD COMMENT • link updated 5.6 years ago by Ram 44k • written 6.4 years ago by jerviedog ▴ 260

Ram · Accepted Answer · 2014-04-22

12

Entering edit mode

10.7 years ago

vlaufer ▴ 290

Give a statistical geneticist an awk line, feed him for a day, teach a statistical geneticist how to awk, feed him for a lifetime... check out this page for find and replace using awk:

http://awk.info/?awk1line

ADD COMMENT • link updated 5.1 years ago by Ram 44k • written 10.7 years ago by vlaufer ▴ 290