Question

SNP position in sample sequences

0

Entering edit mode

9.1 years ago

chandimafdo • 0

Hi,

I'm doing a SNP identification using samtools mpileup tool. I use one reference file and 3 sample files ( all s bam files are run together so I get one vcf file ). I was able to get the .vcf file as the output. The POS in

#CHROM    POS    ID    REF    ALT    QUAL    FILTER    INFO    FORMAT

gives me the SNP position in the reference genome. However I want get the SNP positions at each of the sample files. Does mpileup has a command in it ? If not, what can I do? ( I read about -D, but it gives the sample read depth ). I use Python as the programming language.

mpileup SNP samtools • 2.8k views

ADD COMMENT • link updated 2.2 years ago by Ram 44k • written 9.1 years ago by chandimafdo • 0

1

Entering edit mode

9.1 years ago

sunhanice ▴ 240

There should be several columns after FORMAT column, each containing one sample.

For example, in this line of vcf:

5       11956   snp_5_11956     G       C       100.00  PASS    AA=.;AC=56;AF=0.03;AFR_AF=0.11;AMR_AF=0.01;AN=2184;AVGPOST=0.9907;DAF_GLOBAL=.;ERATE=0.0005;GERP=.;LDAF=0.0273;RSQ=0.8888;SNPSOURCE=LOWCOV;THETA=0.0186;VT=SNP;ANNOTATION_CLASS=ACTIVE_CHROM;CELL=GM12878;CHROM_STATE=15        GT:GL:DS:PP:BD  0/0:.:.:1,0,0:0 0/1:.:.:0.031,0.969,0:0.969

There are two samples. One is 0/0:.:.:1,0,0:0 and the other is 0/1:.:.:0.031,0.969,0:0.969.

The values separated by colon are indicated by the FORMAT column, GT:GL:DS:PP:BD

ADD COMMENT • link updated 5.0 years ago by Ram 44k • written 9.1 years ago by sunhanice ▴ 240

0

Entering edit mode

Thank you for the reply, However I read the vcf specification for genotype data 1.4.2. But none of the fields contain the position of the sample sequence at each SNP.

ADD REPLY • link updated 5.0 years ago by Ram 44k • written 9.1 years ago by chandimafdo • 0

0

Entering edit mode

Each line has one position. Some samples at this position have SNP (0/1, 1/1), and some samples not (0/0).

ADD REPLY • link 9.1 years ago by sunhanice ▴ 240

0

Entering edit mode

9.1 years ago

cpad0112 21k

If you already have the 3 VCF files (from each sample), try merging VCFs. There are several tools to merge VCFs.

ADD COMMENT • link updated 2.2 years ago by Ram 44k • written 9.1 years ago by cpad0112 21k

Ram · Accepted Answer · 2015-10-07

2

Entering edit mode

9.1 years ago

Pierre Lindenbaum 164k

run mpileup with the 3 bams (!= run 3 mpileups with one bam)

ADD COMMENT • link 9.1 years ago by Pierre Lindenbaum 164k

0

Entering edit mode

Thank you for the reply. I do run with the 3 bams.

bin/samtools mpileup -g -f "/reference.fasta" -d 8000 "S1.bam" "S2.bam" "S3.bam"

Assume,

reference : CTAAGTA ( position would be 4, this can be found under #CHROM POS in vcf output file )
---------------^
Sample 1: -CTACGTA ( what would be the position here ? is it same as reference or different? I need the position of C this case )
--------------^
Sample 2: -CTATGTA ( similar to the above, what would be the position of T in sample2 sequence )
--------------^

ADD REPLY • link updated 5.0 years ago by Ram 44k • written 9.1 years ago by chandimafdo • 0

0

Entering edit mode

Could you please post 1 or 2 lines from your output? (without header lines (starting with ##), if it is VCF).

ADD REPLY • link 9.1 years ago by cpad0112 21k

0

Entering edit mode

#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  /sorted_bam/A01.bam sorted_bam/A02.bam
FHUJM:00049:00085   89  .   T   G,<X>   0   .   DP=2;I16=1,0,1,0,30,900,23,529,37,1369,25,625,25,625,1,1;QS=1,1,0;SGB=-0.516033;RPB=1;MQB=1;BQB=1;MQ0F=0    PL:DP   23,3,0,23,3,23:1    0,3,30,3,30,30:1

ADD REPLY • link updated 5.0 years ago by Ram 44k • written 9.1 years ago by chandimafdo • 0

0

Entering edit mode

If it's not enough, I can email you my whole vcc file. Please let me know.

ADD REPLY • link 9.1 years ago by chandimafdo • 0

1

Entering edit mode

Thanks for the record.

If I take your example in OP:

Reference: CTAAGTA
--------------^
Sample 1: CTACGTA
-------------^
Sample 2: CTATGTA
-------------^

VCF would look like this after alignment:

#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT /sorted_bam/A01.bam sorted_bam/A02.bam
FHUJM:00049:00085 4 . A C,T 0 . DP=2;I16=1,0,1,0,30,900,23,529,37,1369,25,625,25,625,1,1;QS=1,1,0;SGB=-0.516033;RPB=1;MQB=1;BQB=1;MQ0F=0 PL:DP 23,3,0,23,3,23:1 0,3,30,3,30,30:1

which means

you have A, at position 4, in reference (Ref column)
C and T are at position 4 (Alt column) as SNPs for two samples.

In general, Sample SNPs (bases) share the same position as Reference bases for each line. But do remember that this position is calculated only after alignment and further pruning steps. I hope I understand your question. If not, do let us know.

However let us say, you already have a single VCF with multiple samples, you would like to extract records for only one of the samples, then you can following tool and command:

 java -jar GenomeAnalysisTK.jar \
   -T SelectVariants \
   -R <your_reference_fasta> \
   -V <your_input_vcf> \
   -o <expected_output_vcf> \
   -sn <your_sample_name>

ADD REPLY • link updated 5.0 years ago by Ram 44k • written 9.1 years ago by cpad0112 21k