Entering edit mode
5.9 years ago
O.rka
▴
740
I am trying to run the TagReadWithGeneFunction
tool from https://github.com/broadinstitute/Drop-seq and it requires my gtf
file to have gene name
, gene id
, transcript name
, and transcript id
. My gff file from maker
did not have most of these fields.
Is there a tool I can use to fix either my gff file or my converted gtf file?
I used gffread
from http://ccb.jhu.edu/software/stringtie/gff.shtml to convert to gtf.
Here is the output of TagReadWithGeneFunction
:
-bash-4.1$ cat dropseq_output/Frozen_HJC7MBGX5/logging/star_gene_exon_tagged.bam.log
INFO 2019-01-24 02:24:17 TagReadWithGeneFunction
********** NOTE: Picard's command line syntax is changing.
**********
********** For more information, please see:
********** https://github.com/broadinstitute/picard/wiki/Command-Line-Syntax-Transition-For-Users-(Pre-Transition)
**********
********** The command line looks like this in the new syntax:
**********
********** TagReadWithGeneFunction -INPUT dropseq_output/Frozen_HJC7MBGX5/merged.bam -ANNOTATIONS_FILE /usr/local/projdata/0568/projects/PLANKTON/illumina_aallen/jespinoz/References/Cylindrotheca/annotations/cylindrotheca_rd3.all.maker.gtf -OUTPUT dropseq_output/Frozen_HJC7MBGX5/star_gene_exon_tagged.bam
**********
02:24:19.432 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/home/jespinoz/Algorithms/Mapping/dropseq_pipeline/Drop-seq_tools-2.1.0/jar/lib/picard-2.18.14.jar!/com/intel/gkl/native/libgkl_compression.so
[Thu Jan 24 02:24:19 EST 2019] TagReadWithGeneFunction INPUT=dropseq_output/Frozen_HJC7MBGX5/merged.bam OUTPUT=dropseq_output/Frozen_HJC7MBGX5/star_gene_exon_tagged.bam ANNOTATIONS_FILE=/usr/local/projdata/0568/projects/PLANKTON/illumina_aallen/jespinoz/References/Cylindrotheca/annotations/cylindrotheca_rd3.all.maker.gtf GENE_NAME_TAG=gn GENE_STRAND_TAG=gs GENE_FUNCTION_TAG=gf READ_FUNCTION_TAG=XF USE_STRAND_INFO=true VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json USE_JDK_DEFLATER=false USE_JDK_INFLATER=false
[Thu Jan 24 02:24:19 EST 2019] Executing as jespinoz@dell-15.jcvi.org on Linux 2.6.32-696.18.7.el6.x86_64 amd64; OpenJDK 64-Bit Server VM 1.8.0_121-b15; Deflater: Intel; Inflater: Intel; Provider GCS is not available; Picard version: 2.1.0(e9a342e_1547567906)
[Thu Jan 24 02:24:21 EST 2019] org.broadinstitute.dropseqrna.metrics.TagReadWithGeneFunction done. Elapsed time: 0.03 minutes.
Runtime.totalMemory()=2058354688
Exception in thread "main" picard.annotation.AnnotationException: Invalid GTF line:
000000F|arrow maker exon 266 1689 . + . transcript_id "CylT000001-RA|arrow-processed-gene-0.141-mRNA-1";
Problems:
Missing gene_id
Missing gene_name
Missing transcript_name
at org.broadinstitute.dropseqrna.annotation.GTFParser.next(GTFParser.java:97)
at org.broadinstitute.dropseqrna.annotation.GTFParser.next(GTFParser.java:39)
at htsjdk.samtools.util.PeekableIterator.advance(PeekableIterator.java:71)
at htsjdk.samtools.util.PeekableIterator.<init>(PeekableIterator.java:38)
at org.broadinstitute.dropseqrna.utils.FilteredIterator.<init>(FilteredIterator.java:37)
at org.broadinstitute.dropseqrna.annotation.GTFReader$FilteringGTFParser.<init>(GTFReader.java:109)
at org.broadinstitute.dropseqrna.annotation.GTFReader$FilteringGTFParser.<init>(GTFReader.java:107)
at org.broadinstitute.dropseqrna.annotation.GTFReader.load(GTFReader.java:74)
at org.broadinstitute.dropseqrna.annotation.GTFReader.load(GTFReader.java:70)
at org.broadinstitute.dropseqrna.annotation.GeneAnnotationReader.loadGTFFile(GeneAnnotationReader.java:67)
at org.broadinstitute.dropseqrna.annotation.GeneAnnotationReader.loadAnnotationsFile(GeneAnnotationReader.java:51)
at org.broadinstitute.dropseqrna.metrics.TagReadWithGeneFunction.doWork(TagReadWithGeneFunction.java:113)
at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:295)
at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:103)
at org.broadinstitute.dropseqrna.cmdline.DropSeqMain.main(DropSeqMain.java:42)
Here is the top few lines of my gff file:
-bash-4.1$ head /usr/local/projdata/0568/projects/PLANKTON/illumina_aallen/jespinoz/References/Cylindrotheca/annotations/cylindrotheca_rd3.all.maker.gff
##gff-version 3
000161F|arrow . contig 1 17082 . . . ID=000161F|arrow;Name=000161F|arrow
000161F|arrow maker gene 35 1654 . - . ID=CylG030441|arrow-snap-gene-0.5;Name=CylG030441|arrow-snap-gene-0.5;Alias=maker-000161F|arrow-snap-gene-0.5;
000161F|arrow maker mRNA 35 1654 . - . ID=CylG030441|arrow-snap-gene-0.5-mRNA-1;Parent=CylG030441|arrow-snap-gene-0.5;Name=CylG030441|arrow-snap-gene-0.5-mRNA-1;Alias=maker-000161F|arrow-snap-gene-0.5-mRNA-1;_AED=0.13;_QI=0|0|0|1|0|0|2|0|510;_eAED=0.07;
000161F|arrow maker exon 1406 1654 . - . ID=CylT030441-RA|arrow-snap-gene-0.5-mRNA-1:exon:28;Parent=CylT030441-RA|arrow-snap-gene-0.5-mRNA-1;
000161F|arrow maker exon 35 1315 . - . ID=CylT030441-RA|arrow-snap-gene-0.5-mRNA-1:exon:27;Parent=CylT030441-RA|arrow-snap-gene-0.5-mRNA-1;
000161F|arrow maker CDS 1406 1654 . - 0 ID=CylT030441-RA|arrow-snap-gene-0.5-mRNA-1:cds;Parent=CylT030441-RA|arrow-snap-gene-0.5-mRNA-1;
000161F|arrow maker CDS 35 1315 . - 0 ID=CylT030441-RA|arrow-snap-gene-0.5-mRNA-1:cds;Parent=CylT030441-RA|arrow-snap-gene-0.5-mRNA-1;
000161F|arrow maker gene 15768 16170 . + . ID=CylG030449|arrow-snap-gene-0.4;Name=CylG030449|arrow-snap-gene-0.4;Alias=maker-000161F|arrow-snap-gene-0.4;
000161F|arrow maker mRNA 15768 16170 . + . ID=CylG030449|arrow-snap-gene-0.4-mRNA-1;Parent=CylG030449|arrow-snap-gene-0.4;Name=CylG030449|arrow-snap-gene-0.4-mRNA-1;Alias=maker-000161F|arrow-snap-gene-0.4-mRNA-1;_AED=0.41;_QI=0|0|0|1|0|0|3|0|94;_eAED=0.41;
Lastly, here is my converted gtf file:
-bash-4.1$ head /usr/local/projdata/0568/projects/PLANKTON/illumina_aallen/jespinoz/References/Cylindrotheca/annotations/cylindrotheca_rd3.all.maker.gtf
000000F|arrow maker exon 266 1689 . + . transcript_id "CylT000001-RA|arrow-processed-gene-0.141-mRNA-1";
000000F|arrow maker exon 1781 2574 . + . transcript_id "CylT000001-RA|arrow-processed-gene-0.141-mRNA-1";
000000F|arrow maker CDS 266 634 . + 0 transcript_id "CylT000001-RA|arrow-processed-gene-0.141-mRNA-1";
000000F|arrow maker exon 326 473 . + . transcript_id "CylT000002-RA|arrow-noncoding-gene-0.302-snoRNA-1";
000000F|arrow maker exon 2798 3631 . - . transcript_id "CylT000003-RA|arrow-processed-gene-0.179-mRNA-1";
000000F|arrow maker CDS 2798 3631 . - 0 transcript_id "CylT000003-RA|arrow-processed-gene-0.179-mRNA-1";
000000F|arrow maker exon 6084 6181 . + . transcript_id "CylT000004-RA|arrow-noncoding-gene-0.303-snoRNA-1";
000000F|arrow maker exon 7306 9415 . + . transcript_id "CylT000005-RA|arrow-snap-gene-0.2-mRNA-1";
000000F|arrow maker exon 9971 10218 . + . transcript_id "CylT000005-RA|arrow-snap-gene-0.2-mRNA-1";
000000F|arrow maker CDS 7306 9415 . + 0 transcript_id "CylT000005-RA|arrow-snap-gene-0.2-mRNA-1";
If your function only needs the missing
gene_id
,gene_name
, andtranscript_name
in the attributes column, without a real value, you can try to add them to every line with no value (hencegene_id ""; gene_name ""; transcript_name ""
) after thetranscript_id
. You can try AWK for such manipulation. No guarantees, I don't know what the function exactly needs to prevent the error.It looks like for this record:
000161F|arrow maker mRNA 35 1654 . - . ID=CylG030441|arrow-snap-gene-0.5-mRNA-1;Parent=CylG030441|arrow-snap-gene-0.5;Name=CylG030441|arrow-snap-gene-0.5-mRNA-1;Alias=maker-000161F|arrow-snap-gene-0.5-mRNA-1;_AED=0.13;_QI=0|0|0|1|0|0|2|0|510;_eAED=0.07;
The gene id/name is isCylG030441|arrow-snap-gene-0.5
while transcript id/nameCylG030441|arrow-snap-gene-0.5-mRNA-1
.