DiscoSnp Segmentation fault
0
0
Entering edit mode
9.3 years ago
Hans ▴ 140

Hello

I am trying for the first time to use discoSnp for snp discovery and calling. I have de-multiplexed data of 100 samples from a GBS project. In this project, to reduce complexity, DNA was cut with a restriction enzyme and only reads starting with the right cut site are included. The read files in fq.gz format are listed in the gzlist file with one file per line. The pipeline is run on a linux (Ubuntu 14.04) severer with 16 cores and 128 GB of RAM. This the the stdout I get:

hanan@icci:~/stacks/samp$ ~/disco/run_discoSnp++.sh -r gzlist -T -u 12
use read set: gzlist
use at most 12 cores
Binaries in /storage/hanan/disco/build/
    Running discoSnp++ 2.2.1, in directory /storage/hanan/stacks/samp with following parameters:
         read_sets=gzlist
         prefix=discoRes_k_31_c_auto
         c=auto
         C=2147483647
         k=31
         b=0
         d=1
         D=100
     starting date=Wed Aug 26 12:51:36 IDT 2015

    ############################################################
    #################### GRAPH CREATION  #######################
    ############################################################
/storage/hanan/disco/build//ext/gatb-core/bin/dbgh5 -in gzlist_removemeplease -out discoRes_k_31_c_auto -kmer-size 31 -abundance-min auto -abundance-max 2147483647 -solidity-kind one -nb-cores 12
/storage/hanan/disco/run_discoSnp++.sh: line 326: 26144 Segmentation fault      $DISCO_BUILD_PATH/ext/gatb-core/bin/dbgh5 -in ${read_sets}_removemeplease -out $h5prefix -kmer-size $k -abundance-min $c_dbgh5 -abundance-max $C -solidity-kind one $option_cores_gatb
there was a problem with graph construction

Please help

Thank you
Hanan

snp discosnp • 4.6k views
ADD COMMENT
2
Entering edit mode

Hi Hanan,

As the program crashes early (before any message is displayed), one may suspect an input file format problem. Would you mind double-checking the gzlist file of files (and/or send it to me) and verifying that each pointed fastq/fasta file really exists?

ADD REPLY
2
Entering edit mode

Hello Pierre

Thank you for the response. The problem was that some of the .fq.gz files were empty, so the program crashed. Deleting these files from the file of files list solved the problem. Please send me the new version if you think it is safe to use. If not, I will wait.

Thank you

Hanan

ADD REPLY
0
Entering edit mode

Just tried the same command on a subset of 20 samples and so far so good.

ADD REPLY
0
Entering edit mode

Hi

When I used the branching option -b 0 with 20 samples, I got only ~4400 SNP. When I used -b 1, the memory usage went up to maximum which is 128GB in the KISSREADS MODULE and after a while the program was killed. I should mention that the row data comes from a bit more than one lane of Hiseq200.

ADD REPLY
0
Entering edit mode

Hi Hanan,

I start to answer the easiest question: this memory problem in kissreads is fixed in an incoming new version that will be released this week or the next one. I may send this new version to you if you desire, but it is not 100% tested yet.

ADD REPLY
0
Entering edit mode

hello Pierre, is the new version available yet? thank you, Hanan

ADD REPLY
0
Entering edit mode

Hi.

It's ready. I'm waiting a few test results. Depending on their results, the new version will be available this week.

Pierre

ADD REPLY
0
Entering edit mode

Hi,

Thanks for this feed back, we should improve the output messages in this case.

Pierre

ADD REPLY
0
Entering edit mode

Hi All,

DiscoSnp++ 2.2.1 was released.

You may find it from here.

Pierre

ADD REPLY
0
Entering edit mode

Hi I am sorry but running the new version did not solve the memory problem. It have reached the maximum of my RAM and then crashed after an hour or so. I am using a 16 core Ubuntu 14.04 server with 128GB RAM the raw data is from 1.5 lanes of hiseq2000. I do no know if this should be a problem, but this is a GBS data where the genome is cut by restriction enzyme and all the reads are restricted to the cut sites and do not cover the whole genome. I have used this command:

run_discoSnp++.sh -r gzlist -u 14 -b 1

And this is the final output I got

/storage/hanan/disco/build//tools/kissreads2/kissreads2 -predictions discoRes_k_31_c_auto_D_100_P_1_b_1.fa -reads  gzlist -co discoRes_k_31_c_auto_D_100_P_1_b_1\_coherent -unco discoRes_k_31_c_auto_D_100_P_1_b_1\_uncoherent -k 31 -size_seeds 26 -index_stride 4 -hamming 1  -genotype -nb-cores 14
Indexing bank discoRes_k_31_c_auto_D_100_P_1_b_1.fa -- 95 read set(s) 924970473 reads
[Mapping read set 0                      ]  120  %   elapsed:  90 min 27 sec   remaining:   0 min 0  sec   cpu: 1082.6 %   mem: [106835, 106835, 106843] MB
[Mapping read set 1                      ]  110  %   elapsed: 147 min 44 sec   remaining:   0 min 0  sec   cpu: 894.1 %   mem: [126197, 126901, 126950] MB /storage/hanan/disco/run_discoSnp++.sh: line 374: 16003 Killed                  $DISCO_BUILD_PATH/tools/kissreads2/kissreads2 -predictions $kissprefix.fa -reads $read_sets -co $kissprefix\_coherent -unco $kissprefix\_uncoherent -k $k -size_seeds $smallk -index_stride $i -hamming $d $genotyping $option_cores_gatb
there was a problem with kissreads2:

Thank you
Hanan

ADD REPLY
0
Entering edit mode

Hi Hanan

Could you indicate the number of predictions in the discoRes_k_31_c_auto_D_100_P_1_b_1.fa file?

Pierre

ADD REPLY
0
Entering edit mode

Hi Pierre

discoRes_k_31_c_auto_D_100_P_1_b_1.fa have 144696 sequences.

Thank you
Hanan

ADD REPLY
0
Entering edit mode

This memory consumption is crazy. Are you 100% sure, the kissreads version comes from the 2.2.1 release? (your bug looks like a bug fixed a few weeks ago in kissreads).

If this is the case, could you try to run the same command limiting the number of cores to 2 (-nb-cores 2)?

ADD REPLY
0
Entering edit mode

Hello Pierre

Now the software version 2.2.1 runs without memory problems

but I get this output at the end:

/storage/hanan/disco/run_VCF_creator.sh -p discoRes_k_31_c_auto_D_100_P_1_b_1\_coherent.fa -o discoRes_k_31_c_auto_D_100_P_1_b_1\_coherent.vcf
        ##use disco SNPS : discoRes_k_31_c_auto_D_100_P_1_b_1_coherent.fa
        ##output : discoRes_k_31_c_auto_D_100_P_1_b_1_coherent.vcf
...Ghost mode...
...Creation of a vcf without alignment...
Traceback (most recent call last):
  File "/storage/hanan/disco/tools/VCF_creator.py", line 155, in <module>
    if variant_object.CheckCoupleVariantID()==1:
  File "/storage/hanan/disco/tools/ClassVCF_creator.py", line 335, in CheckCoupleVariantID
    bitwiseFlag=int(self.upper_path.listSam[1])
IndexError: list index out of range
... Creation of the vcf file : done ...==> discoRes_k_31_c_auto_D_100_P_1_b_1_coherent.vcf
Vcf creation time in seconds: 0
        ###############################################################
        #################### DISCOSNP++ FINISHED ######################
        ###############################################################
DiscoSnp++ total time in seconds: 6152
        ################################################################################################################
         fasta of predicted variant is "discoRes_k_31_c_auto_D_100_P_1_b_1_coherent.fa"
         Ghost VCF file (1-based) is "discoRes_k_31_c_auto_D_100_P_1_b_1_coherent.vcf"
         Thanks for using discoSnp++ - http://colibread.inria.fr/discoSnp/ - Forum: http://www.biostars.org/t/discoSnp/

The size of the output files is like this:

-rw-r--r-- 1 hanan hanan  363937072 Nov 17 10:28 discoRes_k_31_c_auto.h5
-rw-r--r-- 1 hanan hanan      36264 Nov 17 10:31 discoRes_k_31_c_auto_cov.h5
-rw-r--r-- 1 hanan hanan   84984535 Nov 17 11:40 discoRes_k_31_c_auto_D_100_P_1_b_1_coherent.fa
-rw-r--r-- 1 hanan hanan    1298719 Nov 17 11:40 discoRes_k_31_c_auto_D_100_P_1_b_1_uncoherent.fa
-rw-r--r-- 1 hanan hanan       1413 Nov 17 11:40 discoRes_k_31_c_auto_D_100_P_1_b_1_coherent.vcf

and the vcf file contains only the headers:

##fileformat=VCFv4.1
##filedate=20151117
##source=VCF_creator
##SAMPLE=file://discoRes_k_31_c_auto_D_100_P_1_b_1_coherent.fa
##REF=<ID=REF,Number=1,Type=String,Description="Allele of the path Disco aligned with the least mismatches">
##FILTER=<ID=MULTIPLE,Description="Mapping type : PASS or MULTIPLE or .">
##INFO=<ID=Ty,Number=1,Type=String,Description="SNP, INS, DEL or .">
##INFO=<ID=Rk,Number=1,Type=Float,Description="SNP rank">
##INFO=<ID=UL,Number=1,Type=Integer,Description="length of the unitig left">
##INFO=<ID=UR,Number=1,Type=Integer,Description="length of the unitig right">
##INFO=<ID=CL,Number=1,Type=Integer,Description="length of the contig left">
##INFO=<ID=CR,Number=1,Type=Integer,Description="length of the contig right">
##INFO=<ID=Genome,Number=1,Type=String,Description="Allele of the reference;for indel reference is "." ">
##INFO=<ID=Sd,Number=1,Type=Integer,Description="Reverse (-1) or Forward (1) Alignement">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Cumulated depth accross samples (sum)">
##FORMAT=<ID=PL,Number=G,Type=Integer,Description="Phred-scaled Genotype Likelihoods">
##FORMAT=<ID=AD,Number=2,Type=Integer,Description="Depth of each allele by sample">
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  G1      G2      G3      G4      G5      G6      G7      G8      G9      G10     G11     G12     G13     G14     G15     G16     G17
     G18     G19     G20     G21     G22     G23     G24     G25     G26     G27     G28

Thank you for your help and hope for more peaceful times

Hanan

ADD REPLY
0
Entering edit mode

Hi,

Sorry for this bug and thank you for your message and its peaceful signature, we need this...

The bug is fixed in a new release (2.2.2). This release should be quickly made publicly available. We still need to fix an issue when compiling with mac.

If you can't wait for a few days just tell me I'll send you the new version personally so you can use the new VCF_creator.sh

Best regards,
Pierre

ADD REPLY
0
Entering edit mode

Hi Again,

The new release 2.2.3 which correct this issue is online; http://colibread.inria.fr/software/discosnp/

Best regards,
Pierre

ADD REPLY
0
Entering edit mode

Hello Pierre

Running 2.2.3 I got up to here:

        #############################################################
        #################### KISSREADS MODULE #######################
        #############################################################
/storage/hanan/disco/run_discoSnp++.sh: line 398: ((: >31 : syntax error: operand expected (error token is ">31 ")
/storage/hanan/disco/build//tools/kissreads2/kissreads2 -predictions discoRes_k_31_c_auto_D_100_P_1_b_1.fa -reads gzlist -co discoRes_k_31_c_auto_D_100_P_1_b_1_coherent -unco discoRes_k_31_c_auto_D_100_P_1_b_1_uncoherent -k 31 -size_seeds -5 -index_stride 6 -hamming 1 -genotype -coverage_file discoRes_k_31_c_auto_cov.h5 -nb-cores 14
Indexing bank discoRes_k_31_c_auto_D_100_P_1_b_1.fa
Mapping of 657398279 reads
[Mapping read set 0                      ]  0    %   elapsed:   0 min 0  sec   remaining:   0 min 0  sec   cpu:  -1.0 %   mem: [ 210,  210,  210] MB

Thank you
Hanan

ADD REPLY
0
Entering edit mode

Arg...

Thanks. Indeed there is a problem with the size of the seed (5). I'll check this as soon as possible.

Pierre

ADD REPLY
0
Entering edit mode

Hi,

Indeed, there is a small bug in the run_disco script.

You may either:

  • Add line 397 of run_discoSnp++.sh

    smallk=$k
    
  • Use version 2.2.4 which fixes this bug.

If you re-run discoSnp, don't hesitate to use the -g option, in order to avoid to recompute the graph.

Thanks again for warning us about this issue.

Pierre

ADD REPLY
1
Entering edit mode

Using version 2.2.4 with u=2 b=1 , everything is OK . I think it will work also with u=14. Thank you, Hanan

ADD REPLY

Login before adding your answer.

Traffic: 2172 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6