Question

Problem trying to convert from .gtf to .fasta file

0

Entering edit mode

4.7 years ago

nattzy94 ▴ 60

I have a master.list.gtf (generated from Cufflinks on RNAseq data) that I wish to convert to .fasta. So far I have tried using the gffread function in Cufflinks and the getfasta function from bedtools:

# gffread command
 ./gffread path/to/master.list.gtf -g /path/to/GRCh38.p13.fna -w ./master.list.fasta

# bedtools command
bedtools getfasta -fi path/to/GRCh38.p13.fna -bed /path/to/master.list.gtf

However when I ran these commands I get the error:

WARNING. chromosome (chr1) was not found in the FASTA file. Skipping.

Presumably this is because the chromosome IDs in the .fna file and .gtf don't match up. The .fasta reference file I am using begins like this:

CM000663.2 Homo sapiens chromosome 1, GRCh38 reference primary assembly
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN

whereas my gtf file is formatted like so:

#!genome-build GRCh38.p2
#!genome-version GRCh38
#!genome-date 2013-12
#!genome-build-accession NCBI:GCA_000001405.17
#!genebuild-last-updated 2015-01
chr1    havana  gene    11869   14409   .       +       .       gene_id "ENSG00000223972"; gene_version "5"; gene_name "DDX11L1"; gene_source "havana"; gene_biotype "transcribed_unprocessed_pseudogene";
chr1    havana  transcript      11869   14409   .       +       .       gene_id "ENSG00000223972"; gene_version "5"; transcript_id "ENST00000456328"; transcript_version "2"; gene_name "DDX11L1"; gene_source "havana"; gene_biotype "transcribed_unprocessed_pseudogene"; transcript_name "DDX11L1-002"; transcript_source "havana"; transcript_biotype "processed_transcript"; tag "basic"; transcript_support_level "1";

How can I edit my gtf file so that the chromosome reference are the same?

RNA-Seq bash • 2.4k views

ADD COMMENT • link updated 4.7 years ago by saadbadday ▴ 10 • written 4.7 years ago by nattzy94 ▴ 60

1

Entering edit mode

Or you could get the reference from Ensembl so it matches the GTF?

ADD REPLY • link 4.7 years ago by GenoMax 147k

0

Entering edit mode

i managed to get the ref genome used to generate the gtf file and converted the gtf to fasta successfully. thanks!

ADD REPLY • link 4.7 years ago by nattzy94 ▴ 60

0

Entering edit mode

Please paste the exact files that you have used (for other users) - thanks. I will then move this to an answer. Please also paste the commands that you used - again, thanks.

ADD REPLY • link 4.7 years ago by Kevin Blighe 88k

1

Entering edit mode

I managed to get the ref genome from the graduate student who generated the gtf file. However I believe you can also download the fasta reference (release 79) from here: ftp://ftp.ensembl.org/pub/release-79/fasta/homo_sapiens/dna. I had problems connecting to the ftp server due to issues with my school wifi.

The command I used was ./gffread path/to/master.list.gtf -g /path/to/GRCh38.p13.fna -w ./master.list.fasta

An index of the reference genome will be made if there isn't one already.

ADD REPLY • link 4.7 years ago by nattzy94 ▴ 60

0

Entering edit mode

Why don't you simply download the matching fasta file for your GTF, which should be this one from GENCODE:

ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_22/GRCh38.primary_assembly.genome.fa.gz

ADD REPLY • link 4.7 years ago by ATpoint 85k

0

Entering edit mode

The .gtf file was previously generated by a PhD student running Cufflinks on fasta files of RNAseq data from my lab. He has told me that the .gtf contains novel transcripts (transcripts with no 'transcript biotype' annotation on ensembl). I assume that the fasta sequences for these novel transcripts cannot be found in the matching fasta file on GENCODE?

ADD REPLY • link 4.7 years ago by nattzy94 ▴ 60

score 0 · Answer 1 · 2020-03-26

0

Entering edit mode

4.7 years ago

saadbadday ▴ 10

please i didn't see like your command before( ./gffread path/to/master.list.gtf -g /path/to/GRCh38.p13.fna -w ./master.list.fasta) i suggest you to read http://ccb.jhu.edu/software/stringtie/gff.shtml for explaining gtf file and gffread . you can use transcript.gtf out put file from cufflinks then you should use this command for each file:

"gffread -w transcripts.fa -g /path/to/genome.fa transcripts.gtf "

or you can merge all transcript.gtf for all sampls to produce cuffmerge-OUT/merged.gtf from cuffmerge present inside cufflinks then put merged.gtf instead of transcripts.gtf in command above and you will get transcripts.fa file. i wish help you to solve your problem.thanks

ADD COMMENT • link 4.7 years ago by saadbadday ▴ 10

0

Entering edit mode

Hi, is this an answer to the original question by the user nattzy94 ? Thank you!