Contig name difference due to reference genome
0
0
Entering edit mode
5.4 years ago
nuketbilgen ▴ 40

Hi everyone,

I have vcf files of 4 feline genomes, but in vcf header I see different contig names. I checked the reference genome file line, you can see it below.

reference=file:///ifswh1/BC_COM_P1/F18FTSEUHT0898/CATsxlR/analysis/index/GCF_000181335.3_Felis_catus_9.0_genomic.fa
reference=file:///ifshk5/BC_AS/BC_COM_P0/F19FTSEUHT0354/CATbelR/2016/result/index/felCat9.fa

Two of my genomes aligned to the first one, the other two aligned to the second one. I want to merge this vcfs and run an LD analysis but I can not.

How can I solve this? Thanks...

next-gen genome alignment • 1.6k views
ADD COMMENT
0
Entering edit mode

Are they the same genome builds?

ADD REPLY
0
Entering edit mode

A quick Google-search yielded: felCat9.fa (UCSC Genome Browser) and GCF_000181335.3_Felis_catus_9.0_genomic.fa (NCBI)

ADD REPLY
0
Entering edit mode

exactly yes. When I split vcf files into chr by SnpSift split command, I got 40 files for felcat9.fa aligned files, and I got 426 files for NCBI one. I worry to lose important variants...

ADD REPLY
0
Entering edit mode

I think the biostar community needs more information to your post to help you, such as how the VCF files were produced. If the only difference is in naming, then a quick regular expression or search and replace command can replace the column 1 value from an old, undesired name to a new, desired name.

perl -pe "s/oldname/newname/g" input.vcf > output.vcf

Note that this above command assumes that oldname only occurs in the column1 of the VCF file.

ADD REPLY
0
Entering edit mode

Hi again, vcf files generated by GATK haplotypecaller walker. Haplotype Calling java -jar GenomeAnalysisTK-3.3-0/GenomeAnalysisTK.jar -T HaplotypeCaller -R all.chrs.con.fa -L TEST_Chr01 -I aligned_reads.sorted.dedup.bam --emitRefConfidence GVCF --variant_index_type LINEAR -- variant_index_parameter 128000 -o TEST_Chr01.gvcf

You can find the examples of the contig lines below. These contigs also have variations, and if file has variation on "contig=ID=chrA1_NW_019365239v1_random,length=46965>" same variation is located on "contig=<id=chra1_random,length=415283>" for the other two files. So the chr naming on the same positioned SNPs are different as well...

First two files contig example;

contig=ID=chrA1,length=242100913>

contig=ID=chrA1_random,length=415283>

contig=ID=chrA2,length=171471747>

contig=ID=chrA2_random,length=1187422>

Other two files contig example;

contig=ID=chrA1,length=242100913>

contig=ID=chrA1_NW_019365239v1_random,length=46965>

contig=ID=chrA1_NW_019365240v1_random,length=58068>

contig=ID=chrA1_NW_019365241v1_random,length=50743>

contig=ID=chrA1_NW_019365242v1_random,length=22574>

contig=ID=chrA1_NW_019365243v1_random,length=50951>

contig=ID=chrA1_NW_019365244v1_random,length=50765>

contig=ID=chrA1_NW_019365245v1_random,length=14920>

contig=ID=chrA1_NW_019365246v1_random,length=45003>

contig=ID=chrA1_NW_019365247v1_random,length=40320>

contig=ID=chrA1_NW_019365248v1_random,length=25974>

contig=ID=chrA2,length=171471747> . . .

ADD REPLY
0
Entering edit mode

I know its a long shot, but would you suggest that I merge the files according to their chrs? like this?

I=PasaHardFiltered.chrA1_NW_019365239v1_random.vcf I=PasaHardFiltered.chrA1_NW_019365240v1_random.vcf I=PasaHardFiltered.chrA1_NW_019365241v1_random.vcf I=PasaHardFiltered.chrA1_NW_019365243v1_random.vcf I=PasaHardFiltered.chrA1_NW_019365244v1_random.vcf I=PasaHardFiltered.chrA1_NW_019365246v1_random.vcf I=PasaHardFiltered.chrA1_NW_019365247v1_random.vcf I=PasaHardFiltered.chrA1_NW_019365248v1_random.vcf O=PasaHardFilteredchrA1random.vcf
ADD REPLY

Login before adding your answer.

Traffic: 2126 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6