Entering edit mode
9.4 years ago
Hans
▴
140
Hello
I am trying for the first time to use discoSnp for snp discovery and calling. I have de-multiplexed data of 100 samples from a GBS project. In this project, to reduce complexity, DNA was cut with a restriction enzyme and only reads starting with the right cut site are included. The read files in fq.gz format are listed in the gzlist file with one file per line. The pipeline is run on a linux (Ubuntu 14.04) severer with 16 cores and 128 GB of RAM. This the the stdout I get:
hanan@icci:~/stacks/samp$ ~/disco/run_discoSnp++.sh -r gzlist -T -u 12
use read set: gzlist
use at most 12 cores
Binaries in /storage/hanan/disco/build/
Running discoSnp++ 2.2.1, in directory /storage/hanan/stacks/samp with following parameters:
read_sets=gzlist
prefix=discoRes_k_31_c_auto
c=auto
C=2147483647
k=31
b=0
d=1
D=100
starting date=Wed Aug 26 12:51:36 IDT 2015
############################################################
#################### GRAPH CREATION #######################
############################################################
/storage/hanan/disco/build//ext/gatb-core/bin/dbgh5 -in gzlist_removemeplease -out discoRes_k_31_c_auto -kmer-size 31 -abundance-min auto -abundance-max 2147483647 -solidity-kind one -nb-cores 12
/storage/hanan/disco/run_discoSnp++.sh: line 326: 26144 Segmentation fault $DISCO_BUILD_PATH/ext/gatb-core/bin/dbgh5 -in ${read_sets}_removemeplease -out $h5prefix -kmer-size $k -abundance-min $c_dbgh5 -abundance-max $C -solidity-kind one $option_cores_gatb
there was a problem with graph construction
Please help
Thank you
Hanan
Hi Hanan,
As the program crashes early (before any message is displayed), one may suspect an input file format problem. Would you mind double-checking the gzlist file of files (and/or send it to me) and verifying that each pointed fastq/fasta file really exists?
Hello Pierre
Thank you for the response. The problem was that some of the .fq.gz files were empty, so the program crashed. Deleting these files from the file of files list solved the problem. Please send me the new version if you think it is safe to use. If not, I will wait.
Thank you
Hanan
Just tried the same command on a subset of 20 samples and so far so good.
Hi
When I used the branching option
-b 0
with 20 samples, I got only ~4400 SNP. When I used-b 1
, the memory usage went up to maximum which is 128GB in the KISSREADS MODULE and after a while the program was killed. I should mention that the row data comes from a bit more than one lane of Hiseq200.Hi Hanan,
I start to answer the easiest question: this memory problem in kissreads is fixed in an incoming new version that will be released this week or the next one. I may send this new version to you if you desire, but it is not 100% tested yet.
hello Pierre, is the new version available yet? thank you, Hanan
Hi.
It's ready. I'm waiting a few test results. Depending on their results, the new version will be available this week.
Pierre
Hi,
Thanks for this feed back, we should improve the output messages in this case.
Pierre
Hi All,
DiscoSnp++ 2.2.1 was released.
You may find it from here.
Pierre
Hi I am sorry but running the new version did not solve the memory problem. It have reached the maximum of my RAM and then crashed after an hour or so. I am using a 16 core Ubuntu 14.04 server with 128GB RAM the raw data is from 1.5 lanes of hiseq2000. I do no know if this should be a problem, but this is a GBS data where the genome is cut by restriction enzyme and all the reads are restricted to the cut sites and do not cover the whole genome. I have used this command:
And this is the final output I got
Thank you
Hanan
Hi Hanan
Could you indicate the number of predictions in the
discoRes_k_31_c_auto_D_100_P_1_b_1.fa
file?Pierre
Hi Pierre
discoRes_k_31_c_auto_D_100_P_1_b_1.fa
have 144696 sequences.Thank you
Hanan
This memory consumption is crazy. Are you 100% sure, the kissreads version comes from the 2.2.1 release? (your bug looks like a bug fixed a few weeks ago in kissreads).
If this is the case, could you try to run the same command limiting the number of cores to 2 (-nb-cores 2)?
Hello Pierre
Now the software version 2.2.1 runs without memory problems
but I get this output at the end:
The size of the output files is like this:
and the vcf file contains only the headers:
Thank you for your help and hope for more peaceful times
Hanan
Hi,
Sorry for this bug and thank you for your message and its peaceful signature, we need this...
The bug is fixed in a new release (2.2.2). This release should be quickly made publicly available. We still need to fix an issue when compiling with mac.
If you can't wait for a few days just tell me I'll send you the new version personally so you can use the new
VCF_creator.sh
Best regards,
Pierre
Hi Again,
The new release 2.2.3 which correct this issue is online; http://colibread.inria.fr/software/discosnp/
Best regards,
Pierre
Hello Pierre
Running 2.2.3 I got up to here:
Thank you
Hanan
Arg...
Thanks. Indeed there is a problem with the size of the seed (5). I'll check this as soon as possible.
Pierre
Hi,
Indeed, there is a small bug in the
run_disco
script.You may either:
Add line 397 of run_discoSnp++.sh
Use version 2.2.4 which fixes this bug.
If you re-run discoSnp, don't hesitate to use the
-g
option, in order to avoid to recompute the graph.Thanks again for warning us about this issue.
Pierre
Using version 2.2.4 with u=2 b=1 , everything is OK . I think it will work also with u=14. Thank you, Hanan