Question

Vcf file extraction

0

Entering edit mode

4.5 years ago

C4 ▴ 30

Hi, I have a vcf file from single cell DNA dataset and would like to extract GT, PL tag as well as allelic frequency for each barcode/sample, in this order:

Sample_name, GT, PL, AF

I tried bcftools query -f ' %CHROM %POS[\t%GT\t%PL]\n', but it is not giving me per sample information. Any help would be appreciated. Thanks!

SNP next-gen • 1.4k views

ADD COMMENT • link updated 4.5 years ago by Biostar 20 • written 4.5 years ago by C4 ▴ 30

1

Entering edit mode

works here. Do you have those fields defined, do you have any genotype ?

$ bcftools query -f ' %CHROM %POS[\t%GT\t%PL]\n' ~/src/jvarkit/src/test/resources/rotavirus_rf.vcf.gz | head
 RF01 970   0/0 0,9,47  0/0 0,18,73 0/0 0,18,73 0/0 0,33,116    1/1 95,24,0
 RF02 251   0/0 0,15,57 0/1 31,0,5  0/1 31,0,5  0/0 0,9,42  0/0 0,24,69
 RF02 578   0/0 0,33,122    0/0 0,39,135    0/0 0,39,135    1/1 100,30,0    0/0 0,27,109

ADD REPLY • link 4.5 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

Thanks for your response.

It does give me a warning : Contig 'chrX' is not defined in the header. Should bgzip the vcf file and tabix index it?

Although it runs, and gives an output like this:

chrX    .   .   .   .   .   .   .   .   .   .   .   .   .   .   0/0 0,3,42  .   .   .   .   .   .   .   .   .   .   .   .   .   0/0 0,3,50  .   .   .   .   .   .   .   .   .   .   .   .   .   0/0 0,3,50  .   .   .   .   .   .   .   .   .   .   .   .   .   0/0 0,3,42  .   .   .   .   .   .   0/0 0,3,30  .   .   .   .   .   0

I wanted an output corresponding each sample_name.

ADD REPLY • link updated 4.5 years ago by Ram 45k • written 4.5 years ago by C4 ▴ 30

0

Entering edit mode

Ok, I tabix indexed it. I still get an output -

chrX 251549 .   .   .   .   .   .   .   .   .   .   .   .   .   0/0 0,3,42  .   .   .   .   .   .   .   .   .   .   .   .   .   0/0 0,3,42  .   .   .   .   .   .   .   .   .   .   .   .   .   0/0 0,3,50  .   .   .   .   .   .   .   .   .   .   .   .   .   0/0 0,3,42  .   .   .   .   .   .   .   .   .   .   .   .   .   1/1 42,3,0  .   .   .   .   .   .   .   .   .   .   .   .   .   .

Don't see sample info anywhere.

ADD REPLY • link updated 4.5 years ago by Ram 45k • written 4.5 years ago by C4 ▴ 30

0

Entering edit mode

Don't see sample info anywhere.

it's here ! " . . . . . . . . . . . . . 0/0 0,3,42 . . . . . . . . . . . . . 0/0 0,3,42 . . . . . . . . . . . . . 0/0 0,3,50 . . . . . . . . . . . . . 0/0 0,3,42 . . . . . . . . . . . . . 1/1 42,3,0 . . . . . . . . . . . . . ."

ADD REPLY • link 4.5 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

I think I have it reversed, where columns have sample name, hence all dots. How could I make this more user-readable i.e convert into a tsv file? Thanks a; lot!!

ADD REPLY • link 4.5 years ago by C4 ▴ 30

0

Entering edit mode

your output is already TSV.

ADD REPLY • link 4.5 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

Yes but when I export it to R with read.csv(file, sep="\t), it isn't actually is in the format I need with sample, tags. This is my output from command with PL tag and samples in column, how could I export it in R in a csv or table format for further analysis? Thank you for your help!! [1]AAACAACGACAGTCTA:PL[2]AAACAACGATGATGAA:PL[3]AAACAACGATTCGCCT:PL[4]AAACATGGACCGTTAA:PL[5]AAACATGGACGTTAGT:PL[6> ...................0,3,42.........................................................................................> ..................................................................................................................> ..................................................................................................................> ............................................................0,3,42................................................> ............................................................0,3,42................................................> ............................................................42,3,0................................................> ..0,3,30.0,3,50...................................................................................................> ..0,3,42....0,3,50.............................................................................................

Also, do dots mean that there is missing data for those samples?

ADD REPLY • link 4.5 years ago by C4 ▴ 30

0

Entering edit mode

I actually figured it out, If anyone looking to do something similar. The R library(vcfR) can make vcf files quite user-readable!

ADD REPLY • link 4.5 years ago by C4 ▴ 30