Hello,
Thank you for the previous answers to my questions here (https://www.biostars.org/p/135443/#136088). I have a couple of follow up questions.
Once SNPs are discovered, predictions are listed in the fasta files, and they are typically 60bp, with the SNP located in the middle of the prediction. This is true whether I use 100bp reads or 50bp reads. Now, coherence is established by splitting the 60bp prediction into k-mers (k=31) and assesing their k-mer coverage by all samples allowing one mismatch. And if at least one sample has a k-coverage less than 4, the SNP is considered uncoherent.
I was wondering if the length of the prediction can have an effect on coherency? Should I try to lower the k-mer size for shorter reads?
I was also wondering what is the effect of various coverage in different samples? Should we aim to subsample to the same coverage level?
Lastly, I tried genotype discovery with a -P of 6, are the genotypes the ones reported in the fasta file? In some regions I expect more than 2 genotypes.
Thank you,
Adrian
Thank you for the response! A bit hard to visualize all of this so I inspected the data. I notice right away that the prediction length is 2k-1 under default parameters, and all predictions are the same length. However, when
-P
or-b
are changed, we see that not all predictions have the same length, and we do not necessarily have 2k-1 as a length of prediction. With-P 6
I see the minimum prediction is 61 and maximum is 149. With-P 6 -b 1
, predictions range from 48 to 186bp. I have not yet started playing with-T
or-t
. I am running-b 2
now, but it seems to take a while, probably expected.My dataset consists of 8 populations sequenced at 100bp PE. However, I feed the reads to DiscoSNP as a single file and it treats them all as single end, should I split them up and feed them as PE?
Since the coverage of the genome of interest are variable, I mapped the reads to the genome (
bwa aln -n 0.1
), extracted them, and subsampled to the lowest coverage, 27x, now they all have even coverage which is easier to work with. I had to map to the genome also because aside from my genome of interest (a microorganism), there are also other genome in those populations, contaminants. Let me know if this was not a good idea.I started by analyzing variants of rank=1.0 under default parameters, i.e. simply running discoSNP with the reads as argument. I found 3 rank=1.0 variants. What I noticed, is that for one of these Variants, the reference sequence in the assembled genome is haplotype "T-C-C". The predicted high and low are "A-T-T" and "A-C-T" respectively. What is odd is the fact that the reference haplotype did not get assembled into a prediction, I checked all predictions. After all, this reference haplotype is the predominant form and has reads supporting it in all populations. I then went back to the other parameter predictions, and found that the reference haplotype was present when using
-P 6 -b 1
.I am also really interested in finding homozygous SNPs between populations, such that you only have allele X in some populations and allele Y in others. The allele X and Y should be homozygous, that is, no other alleles should be present in those populations. Any advice on how to go about finding such markers?
Hi Adrian, sorry for the delay.
If your dataset is composed of 8 (or 2*8) read files, the best is to feed disco with these separated files (take care to use option -m and to indicate paired files consecutively in the command line). You'll obtain a much richer result (coverage per read file, rank w.r.t ability of each variant to discriminate read sets).
I think your mapping approach is a nice clever idea. You may reduce
c
to-c 2
or-c 3
with a 27x coverage.I don't understand your point about the T-C-C haplotype. In particular I don't understand the A-T-T versus A-C-T haplotypes. Would you mind posting or sending me (pierre.peterlongo@inria.fr) these 3 variants ?
The homozygous variants (with distinct allele in at least 2 read sets) are those having a rank equal or close to one. By feeding discoSnp++ with you 8 individuals, you'll have access to this piece of information.
Pierre