Reconstruct SNP haplotypes from Illumina short reads
2
0
Entering edit mode
3.1 years ago
rio.simon56 ▴ 10

I have whole genome sequencing data (Illumina paired-end 150bp) for a diverse set of highly polyploid accessions with various levels of ploidy (2n=6-16x). I would like to know if there is a method to get sequencing depths for combination of alleles that are shared on the same read.

For example, for three SNPs with two alleles A and G within 100bp, a given individual could have have the following SNP genotyping :

  • SNP1: depth_A=40, depth_G=60
  • SNP2: depth_A=10, depth_G=90
  • SNP3: depth_A=50, depth_G=50

I would like to obtain the following haplotype genotyping:

  • Haplotype: depth_AAA=5, depth_AAG=2,...,depth_GGG=40

I hope this was clear, thanks for your help.

SNP haplotype • 1.7k views
ADD COMMENT
1
Entering edit mode
3.1 years ago

Not easy ! I think you'll have to write your own script, for example in pysam, after first filtering BAMs for a region of interest.

You can have a look at these tools for read-backed phasing, which go somewhat in your direction - but AFAIK are not appropriate for your non 2n ploidy samples.

https://github.com/secastel/phaser

https://github.com/whatshap/whatshap

ADD COMMENT
1
Entering edit mode
3.1 years ago

I quickly wrote http://lindenb.github.io/jvarkit/BamToHaplotypes.html

the input is an indexed VCF (only diallelic SNP are considered) and a BAM file.

example:

$ java -jar dist/bam2haplotypes.jar -V src/test/resources/rotavirus_rf.vcf.gz src/test/resources/S5.bam

#CHROM  START   END COUNT   N-VARIANTS  (POS\tALT)+
RF03    1221    1242    11  2   1221    C   1242    C
RF03    1688    1708    5   2   1688    G   1708    T
RF04    1900    1920    4   2   1900    C   1920    A
RF06    517 543 9   2   517 C   543 G
RF06    668 695 4   2   668 G   695 T
RF08    926 992 2   2   926 C   992 G
RF09    294 317 6   2   294 T   317 A
RF10    139 175 1   2   139 T   175 G
RF10    139 175 3   2   139 T   175 C

in paired mode

samtools collate -O -u src/test/resources/S5.bam TMP | java -jar dist/bam2haplotypes.jar --paired -V src/test/resources/rotavirus_rf.vcf.gz

#CHROM  START   END COUNT   N-VARIANTS  (POS\tALT)+
RF02    251 578 1   2   251 A   578 G
RF03    1221    1688    1   2   1221    C   1688    G
RF03    1221    1242    7   2   1221    C   1242    C
RF03    1221    1688    1   2   1221    C   1688    G
RF03    1688    1708    1   2   1688    G   1708    T
RF03    1708    2150    1   2   1708    T   2150    T
RF03    1221    1708    1   3   1221    C   1688    G   1708    T
RF03    1221    1688    2   3   1221    C   1242    C   1688    G
RF03    1688    2150    1   3   1688    G   1708    T   2150    T
RF03    1221    1708    2   4   1221    C   1242    C   1688    G   1708    T
RF04    887 1241    1   2   887 A   1241    T
RF04    1900    1920    4   2   1900    C   1920    A
RF05    41  499 2   2   41  T   499 A
RF05    499 879 1   2   499 A   879 C
RF05    795 1297    2   2   795 A   1297    T
RF05    879 1297    2   2   879 C   1297    T
RF06    517 543 9   2   517 C   543 G
RF06    668 695 4   2   668 G   695 T
RF07    225 684 1   2   225 C   684 G
RF07    225 684 1   2   225 C   684 T
RF08    926 992 2   2   926 C   992 G
RF09    294 317 6   2   294 T   317 A
RF10    139 175 1   2   139 T   175 G
RF10    139 175 3   2   139 T   175 C

I didn't write any test, so tell me if you think something is wrong.

ADD COMMENT
0
Entering edit mode

Thanks a million for this! I think I would have needed several months to code something comparable.

I have tried to run it on a test VCF dataset of my own with one individual and two SNPs:

#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  Test_Ind
Sh05    1738002 .   T   C   .   .   AC=5;AN=12  GT:AD:DP:GC 0/0/0/0/0/0/0/1/1/1/1/1:122,89:211:168.534
Sh05    1738019 .   G   A   .   .   AC=3;AN=12  GT:AD:DP:GC 0/0/0/0/0/0/0/0/0/1/1/1:159,61:220:8.92207

With the proposed method, I obtain:

#CHROM  START   END     COUNT   N-VARIANTS      (POS\tALT)+
Sh05    1738002 1738019 78      2       1738002 C       1738019 G
Sh05    1738002 1738019 1       2       1738002 T       1738019 G
Sh05    1738002 1738019 1       2       1738002 T       1738019 A
Sh05    1738002 1738019 6       2       1738002 T       1738019 G
...
[several dozen lines]
...
Sh05    1738002 1738019 2       2       1738002 T       1738019 G
Sh05    1738002 1738019 2       2       1738002 T       1738019 A

There seems to be an issue in that it aggregates results only for the first category "CG". I don't know to which extent this is complicated to solve?

Otherwise the results are very comprehensive, thanks again.

ADD REPLY
1
Entering edit mode

After some exchanges with Pierre, the program has been updated and the problem is now solved:

#CHROM  START   END     COUNT   N-VARIANTS      (POS\tALT)+
Sh05    1738002 1738019 78      2       1738002 C       1738019 G
Sh05    1738002 1738019 53      2       1738002 T       1738019 A
Sh05    1738002 1738019 59      2       1738002 T       1738019 G

Thanks a lot to him!

ADD REPLY

Login before adding your answer.

Traffic: 1736 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6