Question

SOAPfuse: problem in constructing database

0

Entering edit mode

5.8 years ago

ChopinSZ • 0

Dear all, I am a green hand in bioinformatics but need to use SOAPfuse to detect gene fusions of my data. I tried my best to follow the author's instruction (https://sourceforge.net/p/soapfuse/wiki/Construct_SOAPfuse_database/) to construct the database. However, when I tried to run:

perl SOAPfuse-S00-Generate_SOAPfuse_database.pl  \
-wg raw/hg19.fa \ 
-gtf raw/Homo_sapiens.GRCh37.75.gtf.gz \ 
-cbd raw/cytoBand.txt.gz \ 
-gf raw/HGNC_Gene_Family_dataset \ 
-rft raw/HumanRef_refseg_symbols_relationship.list \ 
-sd /home/eagletang/SOAPfuse-v1.27 \
-dd ./dataset

##-wg from UCSC. I use cat *.fa > hg19.fa to combine all the chromosome files into one since I couldn't find the exact "hg19.fa" file.
##-gtf from ensembl
##-cbd http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/cytoBand.txt.gz
##-gf http://www.genenames.org/cgi-bin/genefamilies/download-all/tsv
## -rft I wrote it in the textbook myself as instructed

It ran successfully at the beginning but soon I got the warning like this:

(ln -sf /home/eagletang/SOAPfuse-v1.27/source/dataset/transcript.fa /home/eagletang/SOAPfuse-v1.27/source/dataset/transcript_index_bwa/transcript.fa) && (/home/eagletang/SOAPfuse-v1.27/source/bin/aln_bin/bwa-0.5.9/bwa index /home/eagletang/SOAPfuse-v1.27/source/dataset/transcript_index_bwa/transcript.fa)
[bwa_index] Pack FASTA... [bns_fasta2bntseq] zero length sequence. Abort!
sh: line 1:  3053 Aborted                 (core dumped) ( /home/eagletang/SOAPfuse-v1.27/source/bin/aln_bin/bwa-0.5.9/bwa index /home/eagletang/SOAPfuse-v1.27/source/dataset/transcript_index_bwa/transcript.fa )
[bwa_index] Pack FASTA... [bns_fasta2bntseq] zero length sequence. Abort!
sh: line 1:  3059 Aborted                 (core dumped) ( /home/eagletang/SOAPfuse-v1.27/source/bin/aln_bin/bwa-0.5.9/bwa index /home/eagletang/SOAPfuse-v1.27/source/dataset/transcript_index_bwa/transcript.fa )
[bwa_index] Pack FASTA... [bns_fasta2bntseq] zero length sequence. Abort!
sh: line 1:  3065 Aborted                 (core dumped) ( /home/eagletang/SOAPfuse-v1.27/source/bin/aln_bin/bwa-0.5.9/bwa index /home/eagletang/SOAPfuse-v1.27/source/dataset/transcript_index_bwa/transcript.fa )
[bwa_index] Pack FASTA... [bns_fasta2bntseq] zero length sequence. Abort!
sh: line 1:  3071 Aborted                 (core dumped) ( /home/eagletang/SOAPfuse-v1.27/source/bin/aln_bin/bwa-0.5.9/bwa index /home/eagletang/SOAPfuse-v1.27/source/dataset/transcript_index_bwa/transcript.fa )

<Warning>: step transcript_seq_bwa_index failed.
check the command printed, and find what's wrong.

I have no idea what should I do.... Could somebody help please? Thank you!

RNA-Seq SOAPfuse • 3.0k views

ADD COMMENT • link updated 5.4 years ago by simplitia ▴ 130 • written 5.8 years ago by ChopinSZ • 0

1

Entering edit mode

Hello ChopinSZ,

Please use the formatting bar (especially the code option) to present your post better. I've tried to do it but you put command line in another command line, plus you added comments. All mixed up it is quite messy. Please write your comments outside your command line.
code_formatting

Thank you!

ADD REPLY • link 5.8 years ago by Bastien Hervé 5.9k

0

Entering edit mode

Thank you for your advice, Bastien. How about it after I edited the format? By the way, the transcript.fa created by SOAPfuse is empty.

ADD REPLY • link 5.7 years ago by ChopinSZ • 0

1

Entering edit mode

Please make sure that the gtf file used is corresponding to the version of whole_genome reference

Have you made certain of that? You are using genome file from UCSC so it will have the chr1 notation but the GTF file you obtained from Ensembl and it likely contains 1 designation. I suggest that you get the GTF file from UCSC as well. Use the table browser for that.

ADD REPLY • link 5.7 years ago by GenoMax 147k

0

Entering edit mode

Thanks genomax. I have tried the UCSC version, but still didn't work. In my understanding, the "HumanRef_refseg_symbols_relationship.list" is used to specify the 1 in GTF is related to chr1 in hg19.fa. I do worry about if my gtf file is corresponding to the version of whole=genome reference. That's why I first used the old genome version instead of the new one (hg38). However, I have no idea how to check my gtf is correct (feeling quite helpless....). Here is the difference between UCSC version and ensembl one: UCSC

chr1    hg19_knownGene  exon    11874   12227   0.000000    +   .   gene_id "uc001aaa.3"; transcript_id "uc001aaa.3"; 
chr1    hg19_knownGene  exon    12613   12721   0.000000    +   .   gene_id "uc001aaa.3"; transcript_id "uc001aaa.3"; 
chr1    hg19_knownGene  exon    13221   14409   0.000000    +   .   gene_id "uc001aaa.3"; transcript_id "uc001aaa.3"; 
chr1    hg19_knownGene  exon    11874   12227   0.000000    +   .   gene_id "uc010nxr.1"; transcript_id "uc010nxr.1";

ensembl

#!genome-build GRCh38.p12
#!genome-version GRCh38
#!genome-date 2013-12
#!genome-build-accession NCBI:GCA_000001405.27
#!genebuild-last-updated 2018-07
1   havana  gene    11869   14409   .   +   .   gene_id "ENSG00000223972"; gene_version "5"; gene_name "DDX11L1"; gene_source "havana"; gene_biotype "transcribed_unprocessed_pseudogene";
1   havana  transcript  11869   14409   .   +   .   gene_id "ENSG00000223972"; gene_version "5"; transcript_id "ENST00000456328"; transcript_version "2"; gene_name "DDX11L1"; gene_source "havana"; gene_biotype "transcribed_unprocessed_pseudogene"; transcript_name "DDX11L1-202"; transcript_source "havana"; transcript_biotype "processed_transcript"; tag "basic"; transcript_support_level "1";
1   havana  exon    11869   12227   .   +   .   gene_id "ENSG00000223972"; gene_version "5"; transcript_id "ENST00000456328"; transcript_version "2"; exon_number "1"; gene_name "DDX11L1"; gene_source "havana"; gene_biotype "transcribed_unprocessed_pseudogene"; transcript_name "DDX11L1-202"; transcript_source "havana"; transcript_biotype "processed_transcript"; exon_id "ENSE00002234944"; exon_version "1"; tag "basic"; transcript_support_level "1";
1   havana  exon    12613   12721   .   +   .   gene_id "ENSG00000223972"; gene_version "5"; transcript_id "ENST00000456328"; transcript_version "2"; exon_number "2"; gene_name "DDX11L1"; gene_source "havana"; gene_biotype "transcribed_unprocessed_pseudogene"; transcript_name "DDX11L1-202"; transcript_source "havana"; transcript_biotype "processed_transcript"; exon_id "ENSE00003582793"; exon_version "1"; tag "basic"; transcript_support_level "1";

ADD REPLY • link 5.7 years ago by ChopinSZ • 0

1

Entering edit mode

So the UCSC version does not produce any result?

ADD REPLY • link 5.7 years ago by GenoMax 147k

0

Entering edit mode

No, it got the same problem, though the number before aborted were different.

[bwa_index] Pack FASTA... [bns_fasta2bntseq] zero length sequence. Abort!
sh: line 1:  2998 Aborted                 (core dumped) ( /home/eagletang/SOAPfuse-v1.27/source/bin/aln_bin/bwa-0.5.9/bwa index /home/eagletang/SOAPfuse-v1.27/source/dataset/transcript_index_bwa/transcript.fa )
[bwa_index] Pack FASTA... [bns_fasta2bntseq] zero length sequence. Abort!
sh: line 1:  3004 Aborted                 (core dumped) ( /home/eagletang/SOAPfuse-v1.27/source/bin/aln_bin/bwa-0.5.9/bwa index /home/eagletang/SOAPfuse-v1.27/source/dataset/transcript_index_bwa/transcript.fa )
[bwa_index] Pack FASTA... [bns_fasta2bntseq] zero length sequence. Abort!
sh: line 1:  3021 Aborted                 (core dumped) ( /home/eagletang/SOAPfuse-v1.27/source/bin/aln_bin/bwa-0.5.9/bwa index /home/eagletang/SOAPfuse-v1.27/source/dataset/transcript_index_bwa/transcript.fa )
[bwa_index] Pack FASTA... [bns_fasta2bntseq] zero length sequence. Abort!
sh: line 1:  3028 Aborted                 (core dumped) ( /home/eagletang/SOAPfuse-v1.27/source/bin/aln_bin/bwa-0.5.9/bwa index /home/eagletang/SOAPfuse-v1.27/source/dataset/transcript_index_bwa/transcript.fa )

<Warning>: step transcript_seq_bwa_index failed.
check the command printed, and find what's wrong.

ADD REPLY • link 5.7 years ago by ChopinSZ • 0

1

Entering edit mode

Can you try to manually run this command and see what you get?

/home/eagletang/SOAPfuse-v1.27/source/bin/aln_bin/bwa-0.5.9/bwa index /home/eagletang/SOAPfuse-v1.27/source/dataset/transcript_index_bwa/transcript.fa

It looks like your transcript.fa file may have empty fasta sequences (i.e. header only). You may need to remove those.

ADD REPLY • link 5.7 years ago by GenoMax 147k

0

Entering edit mode

it shows:

 [bwa_index] Pack FASTA... [bns_fasta2bntseq] zero length sequence. Abort!
    Aborted (core dumped)

The transcript.fa is empty, even without the header, just empty. What should I remove? Do you mean the header?

ADD REPLY • link 5.7 years ago by ChopinSZ • 0

0

Entering edit mode

Do you think it was about the hardware I run the SOAPfuse? Because I ran it in the PC. Does it have to use a sever or something?

ADD REPLY • link 5.7 years ago by ChopinSZ • 0

1

Entering edit mode

I don't know this specific software you are trying to use but it looks like the creation of a transcript.fa file (using the genome sequence and the GTF?) appears to be failing. Is the -rft file required? Can you try running without it?

ADD REPLY • link 5.7 years ago by GenoMax 147k

0

Entering edit mode

It is required. The soapfuse won't run without it. According to author's instruction, -rft is a list of refseg symbols relationship. It specifies the corresponding relationship of refseg symbols between GTF file and fa file, meaning "1" refers to "chr1". So far, I think the problem is the failure of building the transcript.fa. However I still couldn't figure out its reason. The wrong format/order of hg19.fa? The version of GTF doesn't match hg19.fa? The -rft file not corresponding properly? I just can't find any instruction or examples of them.

P.S. The homepage of soapfuse is: https://sourceforge.net/p/soapfuse/wiki/Home/

ADD REPLY • link 5.7 years ago by ChopinSZ • 0

1

Entering edit mode

Could you give us the :

head -10 raw/hg19.fa

please

ADD REPLY • link 5.7 years ago by Bastien Hervé 5.9k

0

Entering edit mode

>chr10
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN

ADD REPLY • link 5.7 years ago by ChopinSZ • 0

1

Entering edit mode

Your reference genome file start with chr10 ?

Result of :

grep ">" raw/hg19.fa

ADD REPLY • link 5.7 years ago by Bastien Hervé 5.9k

0

Entering edit mode

>chr10
>chr11
>chr11_gl000202_random
>chr12
>chr13
>chr14
>chr15
>chr16
>chr17_ctg5_hap1
>chr17
>chr17_gl000203_random
>chr17_gl000204_random
>chr17_gl000205_random
>chr17_gl000206_random
>chr18
>chr18_gl000207_random
>chr19
>chr19_gl000208_random
>chr19_gl000209_random
>chr1
>chr1_gl000191_random
>chr1_gl000192_random
>chr20
>chr21
>chr21_gl000210_random
>chr22
>chr2
>chr3
>chr4_ctg9_hap1
>chr4
>chr4_gl000193_random
>chr4_gl000194_random
>chr5
>chr6_apd_hap1
>chr6_cox_hap2
>chr6_dbb_hap3
>chr6
>chr6_mann_hap4
>chr6_mcf_hap5
>chr6_qbl_hap6
>chr6_ssto_hap7
>chr7
>chr7_gl000195_random
>chr8
>chr8_gl000196_random
>chr8_gl000197_random
>chr9
>chr9_gl000198_random
>chr9_gl000199_random
>chr9_gl000200_random
>chr9_gl000201_random
>chrM
>chrUn_gl000211
>chrUn_gl000212
>chrUn_gl000213
>chrUn_gl000214
>chrUn_gl000215
>chrUn_gl000216
>chrUn_gl000217
>chrUn_gl000218
>chrUn_gl000219
>chrUn_gl000220
>chrUn_gl000221
>chrUn_gl000222
>chrUn_gl000223
>chrUn_gl000224
>chrUn_gl000225
>chrUn_gl000226
>chrUn_gl000227
>chrUn_gl000228
>chrUn_gl000229
>chrUn_gl000230
>chrUn_gl000231
>chrUn_gl000232
>chrUn_gl000233
>chrUn_gl000234
>chrUn_gl000235
>chrUn_gl000236
>chrUn_gl000237
>chrUn_gl000238
>chrUn_gl000239
>chrUn_gl000240
>chrUn_gl000241
>chrUn_gl000242
>chrUn_gl000243
>chrUn_gl000244
>chrUn_gl000245
>chrUn_gl000246
>chrUn_gl000247
>chrUn_gl000248
>chrUn_gl000249
>chrX
>chrY

Like I said in the comment after the command, I couldn't find the exact "hg19.fa". Instead, I downloaded the chromFa.fa and combined all the chromosome files into one file using cat. I guess that is why the hg19.fa started with chr10. Does the order matter? I will combine them in the right order and try it again.

ADD REPLY • link 5.7 years ago by ChopinSZ • 0

0

Entering edit mode

I combined the chromosome fa file in ascending order and ran it again but got the same results with the different number before aborted. In my shallow understanding, it seems it cannot create the right transcript.fa (it was empty every time I tried).

ADD REPLY • link 5.7 years ago by ChopinSZ • 0

score 0 · Answer 1 · 2019-06-24

Hi so I had the same issue and here is what I did to fix it.
1. You need to make sure that the fasta is indeed prefix with chr which you did above by checking. I mentioned this in case some else downloaded from ensembl which does not. You can easily check this by grepping the fasta,

grep ">" your.fa

if it is not prefix with chr then do this:

sed -e 's/>/>chr/' your.fa > new.fa - also make sure the MT is nomenclature is changed as well.

ok for ucsc don't concatenate, but download the a single soft mask fa directly here. http://hgdownload.cse.ucsc.edu/goldenPath/hg38/bigZips/hg38.fa.gz
Last and this is very important make sure that the havana donwload is the most updated. I made a mistake earlier by not pointing to the correct file.

put this together and it should work.

A