Question

Converting fasta to gtf using reference genome

0

Entering edit mode

6.3 years ago

KI • 0

Hello,

I am trying to convert fasta file of the transcripts to gtf using a reference genome, but I am having some trouble. I specifically need gtf file of mRNA to run a program. The genome and the transcript data were downloaded from (https://parasite.wormbase.org/ftp.html). Following are the approaches I have taken so far:

fasta-to-gtf script (https://github.com/willblev/fasta-to-gtf)

I found a script converts fasta files to gtf using a reference genome file, but I keep getting a syntax error below. I checked the script and could not identify or manipulate the syntax problem in line 3.

./fasta-to-gtf.py: line 3: syntax error near unexpected token `('
./fasta-to-gtf.py: line 3: `def usage():'

Alignment using hisat2

I tried to align the transcript file to the genome using hisat2. However, I'm not sure how to run hisat2 with the transcript file. I tried to use the transcript.fasta and genome.fasta as mate 1 and mate 2 with the index generated from annotation file, but I received the error below. The only fix I found was on bowtie2 github that suggests redownloading the latest version of the program, but I am already using the latest version of hisat2 (https://github.com/BenLangmead/bowtie2/issues/149).

Segmentation fault (core dumped)
(ERR): hisat2-align exited with value 139

Please let me know if you have any other suggestions on this conversion process and if you need any further information.

Thank you!

fasta gtf conversion • 4.5k views

ADD COMMENT • link updated 6.3 years ago by h.mon 35k • written 6.3 years ago by KI • 0

0

Entering edit mode

If you not sure how to run hisat2 -> this nature protocol paper can be a good read.

ADD REPLY • link 6.3 years ago by sangram_keshari ▴ 260

score 0 · Answer 1 · 2018-08-27

First of all, why the GFF3 / GTF files provided at WormBase Parasite are not suitable? Why do you need to recreate information which is already available?

As for the programs you tried: the fasta-to-gtf.py script don't have a shebang line, so you can't make it executable and run it directly. You have to cal it with python fasta-to-gtf.py:

python fasta_to_gtf.py -h

Outputs:

~~~~~~~~~~~~~~~~~~~~~~~~~~
~~~~~~~fasta-to-gtf~~~~~~~
~~~~~~~~~~~~~~~~~~~~~~~~~~
fasta-to-gtf.py -g <path/reference_genome.fasta> -a <path/assembly.fasta> -o <path/output_name.gtf>
This is a python script which produces GTF annotations using a transcriptome assembly and reference genome (in fasta format).
The script requires python 2.7 (or later) and the following python modules:
getopt, CGAT
To run this script, you must specify a reference genome, an assembly, and an output file name.
Use -v or --verbose to print intermediate status updates

As for HISAT2:

I tried to use the transcript.fasta and genome.fasta as mate 1 and mate 2 with the index generated from annotation file,

It is not clear what you did here, but it seems to me you tried HISAT2 with -1 transcript.fasta -2 genome.fasta. Is my interpretation correct? If so, you are running it incorrectly. You have to build the index with the genome.fasta, then you align the transcript.fasta as single-end unpaired reads with -U transcripts.fasta. And HISAT2 output will be a SAM alignment file, not GTF / GFF3, you would need additional steps to arrive at a GTF / GFF3 file.

But anyway, why don't you use the GTF / GFF3 provided at the WormBase Parasite site?