My group was able to retrieve the path matrix of haplotype ids for imputed lines using rPHG, but I was curious if there is a way to retrieve something similar for a single line directly from the database? In the paths table I see paths_data, but was unable to extract anything from the blob.
If I understand correctly you have been able to pull the paths data from the db, but not decode it.
The "paths_blob" contains a snappy compressed list of int, where the first value is the number of hapids, followed by the hapids themselves. When this list is needed by the PHG code, we call the public function "decodePathArray".
To retrieve the values for a single line directly from the database, you'd need to (You may have already done these steps):
identify the genoid id associated with your genome name. This is query "select genoid from genotypes where line_name='<your line name'>.; " (replace the value inside the single quotes with your genome/line name)
Identify the method_id used when creating that path. You can get that from the methods table via query "select method_id from methods where name='<method name used to create the path>'; "(replace value inside the single quotes with your method name)
query the paths table for the paths_blob via "select paths_data from paths where genoid=<value from 1 above> and method_id=<value from 2 above>;"
You then need to decode the paths data. The java method we use internally to decode this data is below. Perhaps you can integrate that into a script - translate to R or python if needed.:
public static int[] decodePathsArray(byte[]dataAsByteArray) {
ByteBuffer bb;
try {
bb = ByteBuffer.wrap(Snappy.uncompress(dataAsByteArray));
} catch (IOException e) {
throw new IllegalStateException("decodePathsArray: could not uncompress dataAsByteArray");
}
bb.rewind();
int bbSize = bb.getInt(); // number of hapids is first value stored in ByteBuffer
int idx = 0;
int[] paths =new int[bbSize];
while (bb.hasRemaining()) {
// each entry is a hapid on the path
paths[idx] =bb.getInt();
idx++;
}
return paths;
}
This was very helpful, thanks again! I have a few followup questions:
I was able to decompress the paths blob in the database for each line, and they match the haplotype ids in the path matrix we retrieved using rPHG. However, the paths from the database were in numerical order, not the order they appear in the path matrix. The reason I wanted to check the paths blob was because I noticed that as I move through path matrix for a line, the asm coordinates of the haplotypes are not always in increasing order.
For example in the path matrix for this line, these three haplotype ids have asm_end_coordinates of 253,942,702; 243,675,190; and 243,675,191, respectively. In the path matrix, these are the last three chromosome 2 haplotype ids before moving on to chromosome 3 haplotypes. Is this expected? When SNPs are called using pathToVCF and a bed file with positions, are all of the haplotypes in a line's path considered?
The assembly coordinates are for the assembly segments that align to those particular ranges in the reference genome. The coordinates can be out of order for (at least) two reasons. First, they might be from different assemblies. The assembly coordinates for any reference range are generally different for every assembly. Second, they might reflect structural rearrangement between an assembly and the reference.
When SNPs are called using a bed file of positions, only the haplotypes containing those positions are used. Briefly, the identity of the haplotype assembly is used to retrieve SNP values from that assembly gVCF.
So if I understood correctly, the assembly coordinates basically tell us what part of the assembly aligns to a certain range on the reference. When a bed file with positions is used, as long as a haplotype has sequence aligned to a specific position in the reference, a SNP can be called regardless of its asm coordinates?
Also, for a consensus haplotype, I know we can find the parent of that haplotype using the database, but is there a way to see which genoids were similar enough to that haplotype to be a part of that consensus group?
If you have access to the database, you can query the taxa that make up a consensus haplotype using the gamete_grp_id for that haplotype. If you have the haplotype id, you can get the gamete_grp_id from the db via the query:
select gamete_grp_id from haplotypes where haplotypes_id=<your haplotypes id>
Then use the returned value in the query below.
For example, let's say your haplotypes_id is 123, you would query:
select gamete_grp_id from haplotypes where haplotypes_id=123;
Assume the response to the above query was 3862. You could use that value in the query below to see the line names of all taxa that were included in your consensus haplotypes. Note that sometimes there is only a single line name returned. This happens when the taxa sequences at this haplotypes were too divergent to be grouped together.
The example below shows the query and db results from this query sent to one of our maize databases:
SELECT gamete_haplotypes.gamete_grp_id, genotypes.line_name FROM gamete_haplotypes INNER JOIN gametes ON gamete_haplotypes.gameteid = gametes.gameteid INNER JOIN genotypes on gametes.genoid = genotypes.genoid where gamete_haplotypes.gamete_grp_id=3862;
gamete_grp_id | line_name
---------------+-------------------
3862 | CML103_anchorwave
3862 | W22_anchorwave
3862 | CML56_anchorwave
3862 | Tx779_anchorwave
(4 rows)
I also would like to know if it's possible to add fastq files and impute paths for just the added taxa after we have completed imputation for a set of lines?
Yes. The path for a line can only be added to the database once for any path method name. Any taxa in a key file that already have paths are skipped and only the new ones are imputed.
Thank you again. I discovered that one of the founders for our population was not included in our original PHG. I realize I will have to go back a few steps, create consensus haplotypes, and impute again. I saved a database before the consensus and imputation steps, and I was wondering if I can add a founder to this database without having to create haplotypes for the other founders again.
Also, we've been trying to do this on AWS. On the path finding step for some reason our costs are higher than expected. We're not exactly sure why, but we think that if we move the necessary components to scratch memory that we can avoid some of these costs that we assume are from reading and writing through our file system. I was wondering if the PHG needs both the pangenome and gvcf files for this step, that way we're not copying files over to scratch that aren't necessary.
Micah - I"m not sure of your question. You can certainly add haplotypes for a new founder to the db. If you need that founder to be part of your consensus, you would have to re-run consensus. Non-consensus haplotypes for the other founders do not need to be recreated.
Your second question related to AWS: I'm a little unclear on what you're doing. What is the command you are running?
For second part, this is occurring when we're imputing from fastq files (homozygous). We used ImputePipelinePlugin -imputeTarget path -endPlugin. This command works, but we believe the program is reading and writing a lot of data in our file system even though the changes to files in the file system aren't huge.
We're thinking that if we copy files being written and read during this step to scratch space, that we may avoid some of these costs. If I understand correctly, minimap2 maps reads to the pangenome.fa and not the gvcf files?
Yes, mimimap2 aligns reads to the pangenome fasta. I'm not familiar with how AWS computes cost, but it is certainly worth a try moving the files to scratch.
Hello again,
For an imputed line is there a way to see mapping statistics at a specific reference range in its path? Would that be the outputSecondaryStats=true? The reason we are wondering is that there are a few reference ranges in our path matrix where most lines have a haplotype from a single founder. Given that we expect this population to have ~1% of this founder, we think it's unlikely that our gbs data are all mapping to this haplotype.
I am also curious if there is something I should adjust in the imputation config to avoid this issue, and if that would more in the mapping reads part of the config or in the path finding part of the config.
Micah - setting "outputSecondaryStats" = true results in a file written with statistics on a per-haplotype basis. The columns of this file are:
hapId hapLength count totalNM totalAS totalDE listOfStartPos
The "count" is the number of reads that hit that a given set of haplotype ids. The NM, AS and DE values come from the SAM file.
I'll let pjb39 answer relating to your population question.
Micah, There are a couple of things to try. One is to set probCorrect to 0.95. That is the probability that a gbs read has been mapped correctly. If the problem is that there is a something about one specific reference range (structural variation?) that is causing reads to mismap, then that is not likely to help, though. Another thing to try would be to use the FilterGraphPlugin prior to path finding to remove the range in question so that it is not used for imputation. That would require running BestHaplotypePlugin outside of the ImputePipeline plugin. We can suggest commands for doing that if you are interested.
Yes, I think I would like to try BestHaplotypePlugin if you have any suggested commands. Below is the config I used for the ImputePipelinePlugin -imputeTarget path -endPlugin. I was wondering if I should try lowering maxRefRangeErr as well. Also, my goal is to have about 2000 maize lines imputed, is there a way to speed up this process? It looks like it will take about 10 days to run all of these lines. I didn't know if I should try increasing threads, or if I could copy the database and run imputation on different sets of lines and concatenate the databases where they differ?
Also, should I be concerned if there is only one entry in my haplotype_list table when I have hundreds of lines in my read_mapping table? I realize this thread is getting long so I'll make another post if I have another question.
Before I suggest commands for using BestHaplotypePathPlugin, I have a few additional questions. First, I notice that some of the method names are "consensus". At this time, we are recommending users not use consensus haplotypes unless they have a specific reason to do so. However, if the mxDiv value used to generate the consensus haplotypes was small enough (<=0.0001) then the difference would be minimal. Another item is that if the lines you are imputing are from a population with a limited number of founders, it can help to restrict the parents/founders used for imputation.
Also, if the founder haplotype in your earlier question is being grouped with other haplotypes that might explain why you are seeing it more often then you expect.
With regard to speed, mapping the reads to the assembly haplotypes is the slow part and is affected by the size/coverage of the fastq files. However, it only needs to be done once. Running again with different path finding parameters is much faster. Also, overwrite parameter does not do anything, when you re-run path finding with different parameters you need to change the pathMethod (but NOT pathHaplotypeMethod).
To recap my specific questions are (1) do you want to restrict the assemblies used for imputing paths and (2) are you imputing inbred lines (so that haploid rather diploid path finding is best)?
Thank you for the detailed response. These are double haploids maize lines, so haploid would be best. Was my config asking the PHG to do both? Our founders are the 26 NAM founders, Mo17, and 11 B73 x Teosinte BC1s, so 38 in total. I believe we do not want to restrict the assemblies used for imputing paths, rather we were hoping to filter out reference ranges we were not as confident about (let's say an imputed line does not have gbs data mapping to a reference range). We are in the process of adding Mo17 as a founder because we did not realize it was a parent to the synthetic population that created our DH lines. I assumed we would have to map the lines we are imputing again because the pangenome file will change.
And so just to be clear, you do not recommend us to create or use consensus haplotypes at this time? That would certainly take some of the guessing work out on our end when trying to determine founder contribution percentages.
I ask about using haploid paths only because I did not want to assume that was the correct choice. For doubled haploids it is certainly the right method to use. The only reason to restrict the number of founders is when you have a population generated from a limited number of founders, which sounds like it is not the case here.
You do not need to use consensus for imputation. One reason you might want to is if the BC1's have been phased, about 3/4 of the haplotypes would be from B73 and you might want to collapse those to a single haplotype. That might have some advantage for interpretation and would make the pangenome fasta a bit smaller. But, I don't think those are compelling reasons, so would recommend not using consensus. I plan to put together the commands for running the BestHaplotypePath early next week. Before that you can run the ImputePathPlugin with imputeTarget map, which will do the read mappings but not impute paths. Be sure to change the value for readMethod if you use raw haplotypes rather than consensus.
After talking to the PHG developers, we agreed the best way to proceed will be to use kmer read mapping, which is a relatively recent addition to the PHG. It is much faster and requires fewer compute resources than using minimap2. However, it is not supported by the ImputePipelinePlugin. Roughly speaking the steps involved are creating a kmer index file, using kmers to map reads, then using the BestHaplotypePathPlugin to find paths. I have added instructions for creating the kmer index at https://bitbucket.org/bucklerlab/practicalhaplotypegraph/wiki/UserInstructions/KmerIndex. I will add instructions to the wiki for the subsequent steps over the next few days.
Ah, this is great. Thank you!
This was very helpful, thanks again! I have a few followup questions:
I was able to decompress the paths blob in the database for each line, and they match the haplotype ids in the path matrix we retrieved using rPHG. However, the paths from the database were in numerical order, not the order they appear in the path matrix. The reason I wanted to check the paths blob was because I noticed that as I move through path matrix for a line, the asm coordinates of the haplotypes are not always in increasing order.
For example in the path matrix for this line, these three haplotype ids have asm_end_coordinates of 253,942,702; 243,675,190; and 243,675,191, respectively. In the path matrix, these are the last three chromosome 2 haplotype ids before moving on to chromosome 3 haplotypes. Is this expected? When SNPs are called using pathToVCF and a bed file with positions, are all of the haplotypes in a line's path considered?
Thank again, Micah
The assembly coordinates are for the assembly segments that align to those particular ranges in the reference genome. The coordinates can be out of order for (at least) two reasons. First, they might be from different assemblies. The assembly coordinates for any reference range are generally different for every assembly. Second, they might reflect structural rearrangement between an assembly and the reference. When SNPs are called using a bed file of positions, only the haplotypes containing those positions are used. Briefly, the identity of the haplotype assembly is used to retrieve SNP values from that assembly gVCF.
That makes sense. Thank you very much for the help!
So if I understood correctly, the assembly coordinates basically tell us what part of the assembly aligns to a certain range on the reference. When a bed file with positions is used, as long as a haplotype has sequence aligned to a specific position in the reference, a SNP can be called regardless of its asm coordinates?
Also, for a consensus haplotype, I know we can find the parent of that haplotype using the database, but is there a way to see which genoids were similar enough to that haplotype to be a part of that consensus group?
Thanks, Micah
hi Micah -
If you have access to the database, you can query the taxa that make up a consensus haplotype using the gamete_grp_id for that haplotype. If you have the haplotype id, you can get the gamete_grp_id from the db via the query:
select gamete_grp_id from haplotypes where haplotypes_id=<your haplotypes id>
Then use the returned value in the query below.
For example, let's say your haplotypes_id is 123, you would query:
select gamete_grp_id from haplotypes where haplotypes_id=123;
Assume the response to the above query was 3862. You could use that value in the query below to see the line names of all taxa that were included in your consensus haplotypes. Note that sometimes there is only a single line name returned. This happens when the taxa sequences at this haplotypes were too divergent to be grouped together.
The example below shows the query and db results from this query sent to one of our maize databases:
SELECT gamete_haplotypes.gamete_grp_id, genotypes.line_name FROM gamete_haplotypes INNER JOIN gametes ON gamete_haplotypes.gameteid = gametes.gameteid INNER JOIN genotypes on gametes.genoid = genotypes.genoid where gamete_haplotypes.gamete_grp_id=3862; gamete_grp_id | line_name ---------------+------------------- 3862 | CML103_anchorwave 3862 | W22_anchorwave 3862 | CML56_anchorwave 3862 | Tx779_anchorwave (4 rows)
Thank you very much!
I also would like to know if it's possible to add fastq files and impute paths for just the added taxa after we have completed imputation for a set of lines?
Thanks, Micah
Yes. The path for a line can only be added to the database once for any path method name. Any taxa in a key file that already have paths are skipped and only the new ones are imputed.
Thank you again. I discovered that one of the founders for our population was not included in our original PHG. I realize I will have to go back a few steps, create consensus haplotypes, and impute again. I saved a database before the consensus and imputation steps, and I was wondering if I can add a founder to this database without having to create haplotypes for the other founders again.
Also, we've been trying to do this on AWS. On the path finding step for some reason our costs are higher than expected. We're not exactly sure why, but we think that if we move the necessary components to scratch memory that we can avoid some of these costs that we assume are from reading and writing through our file system. I was wondering if the PHG needs both the pangenome and gvcf files for this step, that way we're not copying files over to scratch that aren't necessary.
Thanks, Micah
Micah - I"m not sure of your question. You can certainly add haplotypes for a new founder to the db. If you need that founder to be part of your consensus, you would have to re-run consensus. Non-consensus haplotypes for the other founders do not need to be recreated.
Your second question related to AWS: I'm a little unclear on what you're doing. What is the command you are running?
Thank you, that answers the first question.
For second part, this is occurring when we're imputing from fastq files (homozygous). We used ImputePipelinePlugin -imputeTarget path -endPlugin. This command works, but we believe the program is reading and writing a lot of data in our file system even though the changes to files in the file system aren't huge.
We're thinking that if we copy files being written and read during this step to scratch space, that we may avoid some of these costs. If I understand correctly, minimap2 maps reads to the pangenome.fa and not the gvcf files?
Yes, mimimap2 aligns reads to the pangenome fasta. I'm not familiar with how AWS computes cost, but it is certainly worth a try moving the files to scratch.
Hello again, For an imputed line is there a way to see mapping statistics at a specific reference range in its path? Would that be the outputSecondaryStats=true? The reason we are wondering is that there are a few reference ranges in our path matrix where most lines have a haplotype from a single founder. Given that we expect this population to have ~1% of this founder, we think it's unlikely that our gbs data are all mapping to this haplotype.
I am also curious if there is something I should adjust in the imputation config to avoid this issue, and if that would more in the mapping reads part of the config or in the path finding part of the config.
Thanks again, Micah
Micah - setting "outputSecondaryStats" = true results in a file written with statistics on a per-haplotype basis. The columns of this file are: hapId hapLength count totalNM totalAS totalDE listOfStartPos
The "count" is the number of reads that hit that a given set of haplotype ids. The NM, AS and DE values come from the SAM file.
I'll let pjb39 answer relating to your population question.
Micah, There are a couple of things to try. One is to set probCorrect to 0.95. That is the probability that a gbs read has been mapped correctly. If the problem is that there is a something about one specific reference range (structural variation?) that is causing reads to mismap, then that is not likely to help, though. Another thing to try would be to use the FilterGraphPlugin prior to path finding to remove the range in question so that it is not used for imputation. That would require running BestHaplotypePlugin outside of the ImputePipeline plugin. We can suggest commands for doing that if you are interested.
Yes, I think I would like to try BestHaplotypePlugin if you have any suggested commands. Below is the config I used for the ImputePipelinePlugin -imputeTarget path -endPlugin. I was wondering if I should try lowering maxRefRangeErr as well. Also, my goal is to have about 2000 maize lines imputed, is there a way to speed up this process? It looks like it will take about 10 days to run all of these lines. I didn't know if I should try increasing threads, or if I could copy the database and run imputation on different sets of lines and concatenate the databases where they differ?
Also, should I be concerned if there is only one entry in my haplotype_list table when I have hundreds of lines in my read_mapping table? I realize this thread is getting long so I'll make another post if I have another question.
Thanks again, Micah
--- Used for mapping reads
inputType=fastq readMethod=gbs_data keyFile=Current_PHG/phg/updatedReadMapping_key_file.txt fastqDir=Current_PHG/phg/inputDir/imputation/fastq/
samDir=Current_PHG/phg/inputDir/imputation/sam/
lowMemMode=true maxRefRangeErr=0.45 outputSecondaryStats=false maxSecondary=20 fParameter=f5000,6000 minimapLocation=minimap2
--- Used for path finding
pathHaplotypeMethod=consensus pathMethod=consensus overwrite=true maxNodes=100000 maxReads=1000000 minReads=0 minTaxa=1 minTransitionProb=0.001 numThreads=15 probCorrect=0.99 removeEqual=true splitNodes=true splitProb=0.99 usebf=false maxParents = 1000000 minCoverage = 1.0
parentOutputFile = optional
used by haploid path finding only
usebf=false minP=0.8
used by diploid path finding only
maxHap=11 maxReadsKB=100 algorithmType=efficient
Before I suggest commands for using BestHaplotypePathPlugin, I have a few additional questions. First, I notice that some of the method names are "consensus". At this time, we are recommending users not use consensus haplotypes unless they have a specific reason to do so. However, if the mxDiv value used to generate the consensus haplotypes was small enough (<=0.0001) then the difference would be minimal. Another item is that if the lines you are imputing are from a population with a limited number of founders, it can help to restrict the parents/founders used for imputation. Also, if the founder haplotype in your earlier question is being grouped with other haplotypes that might explain why you are seeing it more often then you expect. With regard to speed, mapping the reads to the assembly haplotypes is the slow part and is affected by the size/coverage of the fastq files. However, it only needs to be done once. Running again with different path finding parameters is much faster. Also, overwrite parameter does not do anything, when you re-run path finding with different parameters you need to change the pathMethod (but NOT pathHaplotypeMethod). To recap my specific questions are (1) do you want to restrict the assemblies used for imputing paths and (2) are you imputing inbred lines (so that haploid rather diploid path finding is best)?
Thank you for the detailed response. These are double haploids maize lines, so haploid would be best. Was my config asking the PHG to do both? Our founders are the 26 NAM founders, Mo17, and 11 B73 x Teosinte BC1s, so 38 in total. I believe we do not want to restrict the assemblies used for imputing paths, rather we were hoping to filter out reference ranges we were not as confident about (let's say an imputed line does not have gbs data mapping to a reference range). We are in the process of adding Mo17 as a founder because we did not realize it was a parent to the synthetic population that created our DH lines. I assumed we would have to map the lines we are imputing again because the pangenome file will change.
And so just to be clear, you do not recommend us to create or use consensus haplotypes at this time? That would certainly take some of the guessing work out on our end when trying to determine founder contribution percentages.
Best, Micah
I ask about using haploid paths only because I did not want to assume that was the correct choice. For doubled haploids it is certainly the right method to use. The only reason to restrict the number of founders is when you have a population generated from a limited number of founders, which sounds like it is not the case here. You do not need to use consensus for imputation. One reason you might want to is if the BC1's have been phased, about 3/4 of the haplotypes would be from B73 and you might want to collapse those to a single haplotype. That might have some advantage for interpretation and would make the pangenome fasta a bit smaller. But, I don't think those are compelling reasons, so would recommend not using consensus. I plan to put together the commands for running the BestHaplotypePath early next week. Before that you can run the ImputePathPlugin with imputeTarget map, which will do the read mappings but not impute paths. Be sure to change the value for readMethod if you use raw haplotypes rather than consensus.
After talking to the PHG developers, we agreed the best way to proceed will be to use kmer read mapping, which is a relatively recent addition to the PHG. It is much faster and requires fewer compute resources than using minimap2. However, it is not supported by the ImputePipelinePlugin. Roughly speaking the steps involved are creating a kmer index file, using kmers to map reads, then using the BestHaplotypePathPlugin to find paths. I have added instructions for creating the kmer index at https://bitbucket.org/bucklerlab/practicalhaplotypegraph/wiki/UserInstructions/KmerIndex. I will add instructions to the wiki for the subsequent steps over the next few days.