I have a list of genes. and i want to extract the snps and indels from my VCF file (that i generated using GATK pipeline ) from genes coordinates on . The list of genes coordinates:
Gene Name Accession_no. Start_Position End_Position Strand
Rv0194 NC_000962.3 226878 230462 +
I was looking bedtools but it is asking for .bed format of genes nd as well .bed of bam files. how to do it ? or any other options/tools/scripts?
Like i tried tabix:
bgzip ERR038736_UnifiedGenotyper_variants_raw_snp.vcf
tabix ERR038736_UnifiedGenotyper_variants_raw_snp.vcf.gz
tabix ERR038736_UnifiedGenotyper_variants_raw_snp.vcf.gz AL123456.3:226878-230462 > Rv0194
and this gave me the variants like this:
AL123456.3 227098 . T C 6730.77 . AC=2;AF=1.00;AN=2;DP=172;Dels=0.
AL123456.3 228069 . G A 7132.77 . AC=2;AF=1.00;AN=2;BaseQRankSum=-
AL123456.3 228168 . G C 6682.77 . AC=2;AF=1.00;AN=2;DP=171;Dels=0.
But this is not a vcf file and i can only extract it one at a time. I want to extract all variants against a list of coordinates and store it in a vcf output.
Can anyone help me it this?
can you please explain what -f2-4 is doing? and what -n+2 is doing
Sure :)
cut -f2-4
select the columns 2 to 4 which contain the Accession_no., start and end positiontail -n+2
prints all from the second line onward. This is necessary to get rid of the header line.fin swimmer
Oh great it superb thanks ..
Hello angelshiza,
If an answer was helpful, you should upvote it; if the answer resolved your question, you should mark it as accepted. You can accept more than one if they work. This will aid others who have similar problems in the future.
it doesnot give the output in vcf format and i want the vcf file in the end
Is this a comment to my answer?
tabix
will output avcf
. If you like to output the header data as well type:fin swimmer
Oh that is great thank you!