Hello
I am trying to convert a .gff file (version 3) downloaded from NCBI to a .gtf file, but I am having some trouble. I expected the 9th column of the outputted .gtf files to be "gene_id", but I can't get it to work.
After trying many times, I was recommended to use the .gtf files provided in the iGenomes website.
However, even though it shows me that "gene_id" is in the 9th column (see below), when I try using it in featureCounts, it shows me a warning message:
This is how the .gtf file from iGenomes looks like:
$ head genes.gtf
1 Gnomon exon 33312 35627 . + . gene_id "LOC104970773"; gene_name "LOC104970773"; p_id "P29020"; transcript_id "XM_010800890.1"; tss_id "TSS27855";
1 Gnomon CDS 34375 35558 . + 0 gene_id "LOC104970773"; gene_name "LOC104970773"; p_id "P29020"; transcript_id "XM_010800890.1"; tss_id "TSS27855";
1 Gnomon exon 122904 125014 . - . gene_id "CLIC6"; gene_name "CLIC6"; p_id "P22316"; transcript_id "XM_002684585.4"; tss_id "TSS4167";
1 Gnomon CDS 124853 125014 . - 0 gene_id "CLIC6"; gene_name "CLIC6"; p_id "P22316"; transcript_id "XM_002684585.4"; tss_id "TSS4167";
1 Gnomon CDS 135820 136001 . - 2 gene_id "CLIC6"; gene_name "CLIC6"; p_id "P22316"; transcript_id "XM_002684585.4"; tss_id "TSS4167";
1 Gnomon exon 135820 136001 . - . gene_id "CLIC6"; gene_name "CLIC6"; p_id "P22316"; transcript_id "XM_002684585.4"; tss_id "TSS4167";
1 Gnomon CDS 137158 137264 . - 1 gene_id "CLIC6"; gene_name "CLIC6"; p_id "P22316"; transcript_id "XM_002684585.4"; tss_id "TSS4167";
1 Gnomon exon 137158 137264 . - . gene_id "CLIC6"; gene_name "CLIC6"; p_id "P22316"; transcript_id "XM_002684585.4"; tss_id "TSS4167";
1 Gnomon CDS 137834 137959 . - 1 gene_id "CLIC6"; gene_name "CLIC6"; p_id "P22316"; transcript_id "XM_002684585.4"; tss_id "TSS4167";
1 Gnomon exon 137834 137959 . - . gene_id "CLIC6"; gene_name "CLIC6"; p_id "P22316"; transcript_id "XM_002684585.4"; tss_id "TSS4167";
Here is the warning outputted by featureCounts:
./featureCounts \
-T 1 \
-a ~/RNASeq/Bos_taurus/NCBI/UMD_3.1.1/Annotation/Genes/genes.gtf -s 2 -g gene_id \
-t exon \
-o charolais_${name}.featCounts.txt \
~/RNASeq/RNASeq_STAR/NCBI_UMD3.1.1/charolais_${name}_aligned_Aligned.sortedByCoord.out.bam
========== _____ _ _ ____ _____ ______ _____
===== / ____| | | | _ \| __ \| ____| /\ | __ \
===== | (___ | | | | |_) | |__) | |__ / \ | | | |
==== \___ \| | | | _ <| _ /| __| / /\ \ | | | |
==== ____) | |__| | |_) | | \ \| |____ / ____ \| |__| |
========== |_____/ \____/|____/|_| \_\______/_/ \_\_____/
v1.4.6-p5
//========================== featureCounts setting ===========================\\
|| ||
|| Input files : 1 BAM file ||
|| S /home/undergrad1/RNASeq/RNASeq_STAR/NCBI_U ... ||
|| ||
|| Output file : charolais_145A_TGACCA.featCounts.txt ||
|| Annotations : /home/undergrad1/RNASeq/Bos_taurus/NCBI/UMD_ ... ||
|| ||
|| Threads : 1 ||
|| Level : meta-feature level ||
|| Paired-end : no ||
|| Strand specific : inversed ||
|| Multimapping reads : not counted ||
|| Multi-overlapping reads : not counted ||
|| Read orientations : fr ||
|| ||
\\===================== http://subread.sourceforge.net/ ======================//
//================================= Running ==================================\\
|| ||
|| Load annotation file /home/undergrad1/RNASeq/Bos_taurus/NCBI/UMD_3.1.1 ... ||
Warning: failed to find the gene identifier attribute in the 9th column of the provided GTF file.
The specified gene identifier attribute is 'gene_id'
The attributes included in your GTF annotation are 'gene_name "CFAP20"; p_id "P19047"; transcript_id "NM_001003905.1"; tss_id "TSS12705";'
It seems like the columns are off by one. Does anyone know a proper way to convert a .gff file to a .gtf file or a way I can fix this gtf file from iGenomes?
I've tried the gffread tool from cuffsuite, and some scripts I found online (such as in here http://blog.nextgenetics.net/?e=27) but none of them seemed to work properly. Any help is appreciated!
Your GTF file doesn't conform to the definition: here they write that gene_id and transcript_id must be at the start of the 9th column: "Any other attributes or comments must appear after these two and will be ignored." By the way, this is also written in the blog post you linked to.
Thank you!
I realized that, however I do not know how to change it.
I expected the file downloaded from iGenomes above to be right, but apparently it is not.
Any ideas on how to edit the gtf file without messing it up?
Thanks!
With a perl one-liner like this:
Note that this keeps only the gene_id and transcript_id attributes. Adapt to your needs.
Thank you so much, Jean! Your code was really helpful!