SNP position in sample sequences
3
0
Entering edit mode
9.1 years ago

Hi,

I'm doing a SNP identification using samtools mpileup tool. I use one reference file and 3 sample files ( all s bam files are run together so I get one vcf file ). I was able to get the .vcf file as the output. The POS in

#CHROM    POS    ID    REF    ALT    QUAL    FILTER    INFO    FORMAT

gives me the SNP position in the reference genome. However I want get the SNP positions at each of the sample files. Does mpileup has a command in it ? If not, what can I do? ( I read about -D, but it gives the sample read depth ). I use Python as the programming language.

mpileup SNP samtools • 2.8k views
ADD COMMENT
2
Entering edit mode
9.1 years ago

run mpileup with the 3 bams (!= run 3 mpileups with one bam)

ADD COMMENT
0
Entering edit mode

Thank you for the reply. I do run with the 3 bams.

bin/samtools mpileup -g -f "/reference.fasta" -d 8000 "S1.bam" "S2.bam" "S3.bam"

Assume,

reference : CTAAGTA ( position would be 4, this can be found under #CHROM POS in vcf output file )
---------------^
Sample 1: -CTACGTA ( what would be the position here ? is it same as reference or different? I need the position of C this case )
--------------^
Sample 2: -CTATGTA ( similar to the above, what would be the position of T in sample2 sequence )
--------------^
ADD REPLY
0
Entering edit mode

Could you please post 1 or 2 lines from your output? (without header lines (starting with ##), if it is VCF).

ADD REPLY
0
Entering edit mode
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  /sorted_bam/A01.bam sorted_bam/A02.bam
FHUJM:00049:00085   89  .   T   G,<X>   0   .   DP=2;I16=1,0,1,0,30,900,23,529,37,1369,25,625,25,625,1,1;QS=1,1,0;SGB=-0.516033;RPB=1;MQB=1;BQB=1;MQ0F=0    PL:DP   23,3,0,23,3,23:1    0,3,30,3,30,30:1
ADD REPLY
0
Entering edit mode

If it's not enough, I can email you my whole vcc file. Please let me know.

ADD REPLY
1
Entering edit mode

Thanks for the record.

If I take your example in OP:

Reference: CTAAGTA
--------------^
Sample 1: CTACGTA
-------------^
Sample 2: CTATGTA
-------------^

VCF would look like this after alignment:

#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT /sorted_bam/A01.bam sorted_bam/A02.bam
FHUJM:00049:00085 4 . A C,T 0 . DP=2;I16=1,0,1,0,30,900,23,529,37,1369,25,625,25,625,1,1;QS=1,1,0;SGB=-0.516033;RPB=1;MQB=1;BQB=1;MQ0F=0 PL:DP 23,3,0,23,3,23:1 0,3,30,3,30,30:1

which means

  • you have A, at position 4, in reference (Ref column)
  • C and T are at position 4 (Alt column) as SNPs for two samples.

In general, Sample SNPs (bases) share the same position as Reference bases for each line. But do remember that this position is calculated only after alignment and further pruning steps. I hope I understand your question. If not, do let us know.

However let us say, you already have a single VCF with multiple samples, you would like to extract records for only one of the samples, then you can following tool and command:

 java -jar GenomeAnalysisTK.jar \
   -T SelectVariants \
   -R <your_reference_fasta> \
   -V <your_input_vcf> \
   -o <expected_output_vcf> \
   -sn <your_sample_name>
ADD REPLY
1
Entering edit mode
9.1 years ago
sunhanice ▴ 240

There should be several columns after FORMAT column, each containing one sample.

For example, in this line of vcf:

5       11956   snp_5_11956     G       C       100.00  PASS    AA=.;AC=56;AF=0.03;AFR_AF=0.11;AMR_AF=0.01;AN=2184;AVGPOST=0.9907;DAF_GLOBAL=.;ERATE=0.0005;GERP=.;LDAF=0.0273;RSQ=0.8888;SNPSOURCE=LOWCOV;THETA=0.0186;VT=SNP;ANNOTATION_CLASS=ACTIVE_CHROM;CELL=GM12878;CHROM_STATE=15        GT:GL:DS:PP:BD  0/0:.:.:1,0,0:0 0/1:.:.:0.031,0.969,0:0.969

There are two samples. One is 0/0:.:.:1,0,0:0 and the other is 0/1:.:.:0.031,0.969,0:0.969.

The values separated by colon are indicated by the FORMAT column, GT:GL:DS:PP:BD

ADD COMMENT
0
Entering edit mode

Thank you for the reply, However I read the vcf specification for genotype data 1.4.2. But none of the fields contain the position of the sample sequence at each SNP.

ADD REPLY
0
Entering edit mode

Each line has one position. Some samples at this position have SNP (0/1, 1/1), and some samples not (0/0).

ADD REPLY
0
Entering edit mode
9.1 years ago

If you already have the 3 VCF files (from each sample), try merging VCFs. There are several tools to merge VCFs.

ADD COMMENT

Login before adding your answer.

Traffic: 2046 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6