Dear all, I am a green hand in bioinformatics but need to use SOAPfuse to detect gene fusions of my data. I tried my best to follow the author's instruction (https://sourceforge.net/p/soapfuse/wiki/Construct_SOAPfuse_database/) to construct the database. However, when I tried to run:
perl SOAPfuse-S00-Generate_SOAPfuse_database.pl \
-wg raw/hg19.fa \
-gtf raw/Homo_sapiens.GRCh37.75.gtf.gz \
-cbd raw/cytoBand.txt.gz \
-gf raw/HGNC_Gene_Family_dataset \
-rft raw/HumanRef_refseg_symbols_relationship.list \
-sd /home/eagletang/SOAPfuse-v1.27 \
-dd ./dataset
##-wg from UCSC. I use cat *.fa > hg19.fa to combine all the chromosome files into one since I couldn't find the exact "hg19.fa" file.
##-gtf from ensembl
##-cbd http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/cytoBand.txt.gz
##-gf http://www.genenames.org/cgi-bin/genefamilies/download-all/tsv
## -rft I wrote it in the textbook myself as instructed
It ran successfully at the beginning but soon I got the warning like this:
(ln -sf /home/eagletang/SOAPfuse-v1.27/source/dataset/transcript.fa /home/eagletang/SOAPfuse-v1.27/source/dataset/transcript_index_bwa/transcript.fa) && (/home/eagletang/SOAPfuse-v1.27/source/bin/aln_bin/bwa-0.5.9/bwa index /home/eagletang/SOAPfuse-v1.27/source/dataset/transcript_index_bwa/transcript.fa)
[bwa_index] Pack FASTA... [bns_fasta2bntseq] zero length sequence. Abort!
sh: line 1: 3053 Aborted (core dumped) ( /home/eagletang/SOAPfuse-v1.27/source/bin/aln_bin/bwa-0.5.9/bwa index /home/eagletang/SOAPfuse-v1.27/source/dataset/transcript_index_bwa/transcript.fa )
[bwa_index] Pack FASTA... [bns_fasta2bntseq] zero length sequence. Abort!
sh: line 1: 3059 Aborted (core dumped) ( /home/eagletang/SOAPfuse-v1.27/source/bin/aln_bin/bwa-0.5.9/bwa index /home/eagletang/SOAPfuse-v1.27/source/dataset/transcript_index_bwa/transcript.fa )
[bwa_index] Pack FASTA... [bns_fasta2bntseq] zero length sequence. Abort!
sh: line 1: 3065 Aborted (core dumped) ( /home/eagletang/SOAPfuse-v1.27/source/bin/aln_bin/bwa-0.5.9/bwa index /home/eagletang/SOAPfuse-v1.27/source/dataset/transcript_index_bwa/transcript.fa )
[bwa_index] Pack FASTA... [bns_fasta2bntseq] zero length sequence. Abort!
sh: line 1: 3071 Aborted (core dumped) ( /home/eagletang/SOAPfuse-v1.27/source/bin/aln_bin/bwa-0.5.9/bwa index /home/eagletang/SOAPfuse-v1.27/source/dataset/transcript_index_bwa/transcript.fa )
<Warning>: step transcript_seq_bwa_index failed.
check the command printed, and find what's wrong.
I have no idea what should I do.... Could somebody help please? Thank you!
Hello ChopinSZ,
Please use the formatting bar (especially the
code
option) to present your post better. I've tried to do it but you put command line in another command line, plus you added comments. All mixed up it is quite messy. Please write your comments outside your command line.Thank you!
Thank you for your advice, Bastien. How about it after I edited the format? By the way, the transcript.fa created by SOAPfuse is empty.
Have you made certain of that? You are using genome file from UCSC so it will have the
chr1
notation but the GTF file you obtained from Ensembl and it likely contains1
designation. I suggest that you get the GTF file from UCSC as well. Use the table browser for that.Thanks genomax. I have tried the UCSC version, but still didn't work. In my understanding, the "HumanRef_refseg_symbols_relationship.list" is used to specify the
1
in GTF is related tochr1
in hg19.fa. I do worry about if my gtf file is corresponding to the version of whole=genome reference. That's why I first used the old genome version instead of the new one (hg38). However, I have no idea how to check my gtf is correct (feeling quite helpless....). Here is the difference between UCSC version and ensembl one: UCSCensembl
So the UCSC version does not produce any result?
No, it got the same problem, though the number before aborted were different.
Can you try to manually run this command and see what you get?
It looks like your transcript.fa file may have empty fasta sequences (i.e. header only). You may need to remove those.
it shows:
The transcript.fa is empty, even without the header, just empty. What should I remove? Do you mean the header?
Do you think it was about the hardware I run the SOAPfuse? Because I ran it in the PC. Does it have to use a sever or something?
I don't know this specific software you are trying to use but it looks like the creation of a transcript.fa file (using the genome sequence and the GTF?) appears to be failing. Is the
-rft
file required? Can you try running without it?It is required. The soapfuse won't run without it. According to author's instruction, -rft is a list of refseg symbols relationship. It specifies the corresponding relationship of refseg symbols between GTF file and fa file, meaning "1" refers to "chr1". So far, I think the problem is the failure of building the transcript.fa. However I still couldn't figure out its reason. The wrong format/order of hg19.fa? The version of GTF doesn't match hg19.fa? The -rft file not corresponding properly? I just can't find any instruction or examples of them.
P.S. The homepage of soapfuse is: https://sourceforge.net/p/soapfuse/wiki/Home/
Could you give us the :
please
Your reference genome file start with chr10 ?
Result of :
Like I said in the comment after the command, I couldn't find the exact "hg19.fa". Instead, I downloaded the chromFa.fa and combined all the chromosome files into one file using
cat
. I guess that is why the hg19.fa started with chr10. Does the order matter? I will combine them in the right order and try it again.I combined the chromosome fa file in ascending order and ran it again but got the same results with the different number before aborted. In my shallow understanding, it seems it cannot create the right
transcript.fa
(it was empty every time I tried).