No could not fit >15 [Assemblied by SOAPdenovo]
5
0
Entering edit mode
10.5 years ago

Hi folks. I am trying to use your software KmerGenie. I have some problems. Because it does not draw the histograms at >15 value. It is hilarious when i put without --diploid command. It could tell me the best k but without doing at >15. I tried from 15 to 55 because at more k (to121K) it is more slower (1h30min) at the result is the same "could not fit".

sudo ./kmergenie /home/soba/out_WGS_k25_quitado.fasta --diploid  -l 15 -k 55  -o prueba
running histogram estimation
Linear estimation: ~15 M distinct 38-mers are in the reads
K-mer sampling: 1/2
|

processing                                                                                         |
[going to estimate histograms for values of k: 55 45 35 25 15
---------------------------------------------------------------------------------Total time Wallclock  34.736 s
fitting model to histograms to estimate best k
could not fit prueba-k15.histo
could not fit prueba-k35.histo
could not fit prueba-k25.histo
could not fit prueba-k55.histo
could not fit prueba-k45.histo
could not predict a best k value
Execution of decide failed (return code 0)

sudo ./kmergenie /home/soba/Documentos/out_WGS_k25_quitado.fasta   -l 15 -k 55  -o prueba
running histogram estimation
Linear estimation: ~15 M distinct 38-mers are in the reads
K-mer sampling: 1/2
| processing                                                                                         |
[going to estimate histograms for values of k: 55 45 35 25 15
---------------------------------------------------------------------------------Total time Wallclock  35.6302 s
fitting model to histograms to estimate best k
could not fit prueba-k25.histo
could not fit prueba-k35.histo
could not fit prueba-k45.histo
could not fit prueba-k55.histo
estimation of the best k so far: 15
refining estimation around [15; 21], with a step of 2
running histogram estimation
Linear estimation: ~19 M distinct 21-mers are in the reads
K-mer sampling: 1/3
| processing                                                                                         |
[going to estimate histograms for values of k: 21 19 17 15
---------------------------------------------------------------------------------Total time Wallclock  22.5893 s
fitting model to histograms to estimate best k
could not fit prueba-k19.histo
could not fit prueba-k17.histo
could not fit prueba-k21.histo
could not fit prueba-k25.histo
could not fit prueba-k45.histo
could not fit prueba-k35.histo
could not fit prueba-k55.histo
table of predicted num. of genomic k-mers: prueba.dat
recommended coverage cut-off for best k: 18
best k: 15
KmerGenie • 4.9k views
ADD COMMENT
3
Entering edit mode

Just a note: try not to run programs as root (e.g. do not use sudo).

ADD REPLY
0
Entering edit mode

Ok. Now it showed the line was failed.

./kmergenie /home/soba/Documentos/out_WGS_k25_quitado.fasta   -l 15 -k 55  -o prueba
running histogram estimation
Linear estimation: ~15 M distinct 38-mers are in the reads
K-mer sampling: 1/2
| processing                                                                                         |
[going to estimate histograms for values of k: 55 45 35 25 15
---------------------------------------------------------------------------------Total time Wallclock  33.6264 s
fitting model to histograms to estimate best k
could not fit prueba-k25.histo
could not fit prueba-k35.histo
could not fit prueba-k45.histo
could not fit prueba-k55.histo
Traceback (most recent call last):
  File "/home/soba/Documentos/Ensambladores/kmergenie-1.6663/scripts/decide", line 115, in <module>
    generate_report(histograms_prefix, best_k, max_genomic_kmers, None, is_diploid)
  File "/home/soba/Documentos/Ensambladores/kmergenie-1.6663/scripts/generate_report.py", line 170, in generate_report
    with open(output,'w') as f:
IOError: [Errno 13] Permission denied: 'prueba_report.html'
Execution of decide failed (return code 1)

./kmergenie /home/soba/Documentos/out_WGS_k25_quitado.fasta --diploid  -l 15 -k 55  -o prueba
running histogram estimation
Linear estimation: ~15 M distinct 38-mers are in the reads
K-mer sampling: 1/2
| processing                                                                                         |
[going to estimate histograms for values of k: 55 45 35 25 15
---------------------------------------------------------------------------------Total time Wallclock  35.2915 s
fitting model to histograms to estimate best k
could not fit prueba-k25.histocould not fit prueba-k35.histo

could not fit prueba-k15.histo
could not fit prueba-k55.histo
could not fit prueba-k45.histo
cannot generate histogram for file prueba-k15.histo stdout:  ; stderr: Error en if (zp.copy <= 1 | shape.e <= 0 | scale.e <= 0 | sd.v <= 0 |  :
  valor ausente donde TRUE/FALSE es necesario
Calls: source -> withVisible -> eval -> eval -> model
EjecuciĆ³n interrumpida

Traceback (most recent call last):
  File "/home/soba/Documentos/Ensambladores/kmergenie-1.6663/scripts/decide", line 115, in <module>
    generate_report(histograms_prefix, best_k, max_genomic_kmers, None, is_diploid)
  File "/home/soba/Documentos/Ensambladores/kmergenie-1.6663/scripts/generate_report.py", line 170, in generate_report
    with open(output,'w') as f:
IOError: [Errno 13] Permission denied: 'prueba_report.html'
Execution of decide failed (return code 1)
ADD REPLY
0
Entering edit mode

this one is due to a failure to overwrite the (possibly root) file prueba_report.html

ADD REPLY
0
Entering edit mode

It is me, the same. So, May it be a problem with R program? What are the correct installed packages?

$ uname -a
Linux soba 3.12-1-amd64 #1 SMP Debian 3.12.9-1 (2014-02-01) x86_64 GNU/Linux

$ R
R version 3.0.2 (2013-09-25) -- "Frisbee Sailing"
Copyright (C) 2013 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)
ADD REPLY
0
Entering edit mode

Thanks for reporting this issue. Could you please send the prueba_report.html file? I would like to see what the histograms that didn't fit look like. It seems to be a tiny dataset.

ADD REPLY
0
Entering edit mode

Yes, that is. It is a small dataset 25,770,300pb. I have sent email. Regards.

ADD REPLY
1
Entering edit mode
10.5 years ago
Rayan Chikhi ★ 1.5k

Thanks much for sending me the histograms by email.

It appears that you data set is very low coverage. If you do not mind me posting one of the histograms here:

I wrote a paragraph below but I now realize that your problem might have a more simple explanation: did you give the reads, or the SOAPdenovo assembly, as input to Kmergenie? Only the former is correct, and I suspect that you did the latter.


Kmergenie expects to see a mixture of distributions in a k-mer histogram constructed from the reads, including one (or two in diploid mode) Gaussian distribution for genomic k-mers. In this histogram, we clearly see a large amount of low-coverage k-mers (with more or less an exponentially decreasing distribution), that Kmergenie assumes to be sequencing errors, but no clear Gaussian distribution that should have appeared at higher abundances. Hence Kmergenie was not able to fit its model to this histogram. The effect is even more significant for the larger k values you tested (25, etc..).

ADD COMMENT
1
Entering edit mode
10.5 years ago
5heikki 11k

I would try to address:

File "/home/soba/Documentos/Ensambladores/kmergenie-1.6663/scripts/generate_report.py", line 170, in generate_report
    with open(output,'w') as f:
IOError: [Errno 13] Permission denied: 'prueba_report.html'

By:

chown and/or chmod +x /home/soba/Documentos/Ensambladores/kmergenie-1.6663/scripts/generate_report.py
ADD COMMENT
0
Entering edit mode
10.5 years ago

Ok. I failed. Now I am running with one of the two fastaq without assembly which I deleted the bad quality reads<30. After I m going to talk you how the analysis worked.

ADD COMMENT
0
Entering edit mode
10.5 years ago

Hi. I guess that the k-best at 17 it is an artifact. Also there are two peaks one at 42 and other at 72. What do you think about?

Besides I wonder why don't work with the --diploid option.

Other question. I have the genome in two files. Have I to do the analysis with every file? Or I have to collapse both in a single file.

I appreciate much your advice for a newcomer.

./kmergenie /home/soba/Documentos/Ensambladores/mirdeep2/tutorial_dir/SRR653456_2_e.fastq --diploid  -l 15 -k 55  -o prueba2
running histogram estimation
Linear estimation: ~240 M distinct 38-mers are in the reads
K-mer sampling: 1/45
| processing                                                                                         |
[going to estimate histograms for values of k: 55 45 35 25 15
-----------------------------------------------------------------------------------------------------------------------Total time Wallclock  289.346 s
fitting model to histograms to estimate best k
could not fit prueba2-k15.histo
could not fit prueba2-k25.histo
could not fit prueba2-k35.histo
could not fit prueba2-k55.histo
could not fit prueba2-k45.histo
could not predict a best k value
Execution of decide failed (return code 0)

Here the success trial.

./kmergenie /home/soba/Documentos/Ensambladores/mirdeep2/tutorial_dir/SRR653456_2_e.fastq   -l 15   -o prueba2

running histogram estimation
Linear estimation: ~191 M distinct 51-mers are in the reads
K-mer sampling: 1/36
| processing                                                                                         |
[going to estimate histograms for values of k: 81 71 61 51 41 31 21
-----------------------------------------------------------------------------------------------------------------------Total time Wallclock  396.958 s
fitting model to histograms to estimate best k
estimation of the best k so far: 21
refining estimation around [15; 27], with a step of 2
running histogram estimation
Linear estimation: ~296 M distinct 24-mers are in the reads
K-mer sampling: 1/56
| processing                                                                                         |
[going to estimate histograms for values of k: 27 25 23 21 19 17 15
-----------------------------------------------------------------------------------------------------------------------Total time Wallclock  444.631 s
fitting model to histograms to estimate best k
table of predicted num. of genomic k-mers: prueba2.dat
recommended coverage cut-off for best k: 6
best k: 17

ADD COMMENT
0
Entering edit mode

Did you mean that you have two read files? You should input all the reads that you'd give to an assembler at once, by giving Kmergenie a list of files:

ls -1 *.fastq > list_files

./kmergenie list_files

From what you just posted, Kmergenie estimates a genome size of ~150 Kbp. I don't think it is this what you expect, right? If so, then I'd like to see the new plotted histograms, e.g. for k=31.

ADD REPLY
0
Entering edit mode
10.4 years ago

You are right. The size is around 25.000kpb. I run the test with -l 15 and -l 29 with a unique merged file. And the second trial is said that the best k is 35.

TRIAL 1

./kmergenie /home/soba/Documentos/Ensambladores/mirdeep2/tutorial_dir/list_filesTY.fastq  -l 15   -o trial_l15
running histogram estimation
Linear estimation: ~379 M distinct 51-mers are in the reads
K-mer sampling: 1/72
| processing                                                                                         |
[going to estimate histograms for values of k: 81 71 61 51 41 31 21
-----------------------------------------------------------------------------------------------------------------------Total time Wallclock  815.1 s
fitting model to histograms to estimate best k
estimation of the best k so far: 21
refining estimation around [15; 27], with a step of 2
running histogram estimation
Linear estimation: ~587 M distinct 24-mers are in the reads
K-mer sampling: 1/112
| processing                                                                                         |
[going to estimate histograms for values of k: 27 25 23 21 19 17 15
-----------------------------------------------------------------------------------------------------------------------Total time Wallclock  895.178 s
fitting model to histograms to estimate best k
table of predicted num. of genomic k-mers: trial_l15.dat
recommended coverage cut-off for best k: 14
best k: 15

-l 15

-l 29 TRIAL 2

./kmergenie /home/soba/Documentos/Ensambladores/mirdeep2/tutorial_dir/list_filesTY.fastq  -l 29   -o prueba5
running histogram estimation
Linear estimation: ~379 M distinct 51-mers are in the reads
K-mer sampling: 1/72
| processing                                                                                         |
[going to estimate histograms for values of k: 81 71 61 51 41 31
-----------------------------------------------------------------------------------------------------------------------Total time Wallclock  504.639 s
fitting model to histograms to estimate best k
estimation of the best k so far: 31
refining estimation around [29; 37], with a step of 2
running histogram estimation
Linear estimation: ~546 M distinct 29-mers are in the reads
K-mer sampling: 1/104
| processing                                                                                         |
[going to estimate histograms for values of k: 37 35 33 31 29
-----------------------------------------------------------------------------------------------------------------------Total time Wallclock  527.335 s
fitting model to histograms to estimate best k
table of predicted num. of genomic k-mers: prueba5.dat
recommended coverage cut-off for best k: 12
best k: 35
ADD COMMENT
0
Entering edit mode

Thanks for the update. What's the expected genome coverage of dataset list_filesTY.fastq? From the histograms, it seems to be low.

ADD REPLY
1
Entering edit mode

That is few. ND50 is around 250pb.

K      Contig      Total contig size     Maximum contig size     Minimum contig size     N50 contig size
15     3952639     104,552,854           330                     16                      28
25     1262498     94,115,378            2185                    26                      90
35     524980      50,968,604            2678                    36                      130
41     375,904     43,540,681            2,981                   42                      160
45     295977      38,110,854            2976                    46                      184
48     232480      33,078,324            3442                    50                      203
49     232480      33,078,324            3442                    50                      203
50     209809      30,945,542            2946                    52                      210
53     185046      28,493,028            3487                    54                      218
54     166,067     26,324,247            3,842                   56                      223
55     166067      26,324,247            3842                    56                      223
56     147,071     24,065,106            2,884                   58                      229
57     147,071     24,065,106            2,884                   58                      229
58     128370      21,715,485            2829                    60                      236
60     111855      19,418,668            2864                    62                      240
62     95520       17,110,692            3017                    64                      245
65     80295       14,833,407            3017                    66                      250
70     44929       8,831,428             2479                    72                      263
ADD REPLY
1
Entering edit mode

These are assembly stats (that also seem to indicate that the coverage is low).

You can compute the coverage of your reads dataset as follows: (number of reads * read length) / (genome size). It's independent of k.

ADD REPLY

Login before adding your answer.

Traffic: 2447 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6