How to avoid mRNA feature in output and get only transcript when using AGAT?

Question

GTF file

0

Entering edit mode

9 months ago

IrK ▴ 100

Dear Biostars community,

I have to build a reference for cellranger-arc-2.0.2. However, my organism is too large fro cell ranger, therefore, I had to split fasta file into smaller size chr and consequently modify GTF file:

I changed annotation in the 1st column to align with names from Fasta,
I used AGAT tool to remove any redundancies in 3rd column (gene, transcript, exon),
I left 'gene_id', 'transcript_id', 'gene_name' in 9th column,

but I still run into errors from cellranger-arc mkref. The latest error:

['7H-0-328847192', 'IPK', 'exon', '92282224', '92282629', '.', '+', '.', 'gene_id HORVU.MOREX.r3.7HG0666460; transcript_id HORVU.MOREX.r3.7HG0666460.1; gene_name HvNIP2;']
on line 488171 specifies an 'exon' annotation for a transcript HORVU.MOREX.r3.7HG0666460.1, but there is no 'transcript' row in the GTF for HORVU.MOREX.r3.7HG0666460.1 that immediately precedes it.
    Please fix your GTF and start again.

when I check my GTF file in this row:

7H-0-328847192  IPK exon    92282224    92282629    .   +   .   gene_id HORVU.MOREX.r3.7HG0666460; transcript_id HORVU.MOREX.r3.7HG0666460.1; gene_name HvNIP2;
7H-0-328847192  IPK transcript  92282224    92282629    .   +   .   gene_id HORVU.MOREX.r3.7HG0666460; transcript_id HORVU.MOREX.r3.7HG0666460.1; gene_name HvNIP2;

Any help or advice on how to make GTF suitable for cell ranger would be appreciated.

Thank you

cellranger-arc GTF cellranger-arc-2.0.2 • 4.2k views

ADD COMMENT • link 8 months ago by IrK ▴ 100

0

Entering edit mode

Error mentions .... immediately precedes it.

In the GTF the transcript row is not really preceding. Its after exon. Just bringing to notice.

ADD REPLY • link 9 months ago by Jeffin Rockey ★ 1.3k

0

Entering edit mode

Sorting the GFF/GTF with one of the available tools may fix this. I think the 'normal' sort order of these tools is position, then gene > transcript > exon > CDS

ADD REPLY • link 9 months ago by Michael 55k

0

Entering edit mode

thank you so much for your responses. I had no idea there is own order for 'gene > transcript > exon > CDS' , will take this into consideration @Juke34 , I did modify GTF file, I added a missing transcript row, not sure why AGAT missed it

I start from the beginning in more details:

Modify 1st column in GTF to align to annotations in fasta
AGAT tool:

Populate for missing features: agat_convert_sp_gxf2gxf.pl --gtf ${INGTF} --out ${GFFOUT}.gff

Convert to GTF format: agat_convert_sp_gff2gtf.pl --gff ${GFFOUT}.gff --out ${GTFOUT}.gtf

Remove inner quotes (related to the error: "Parsed attribute had a quote in the middle of a value. Please ensure quotes are only used to encapsulate attribute values. Bad Attribute Value = gene_source"

Got this error:

*"Supplied GTF is invalid. This row
['7H-0-328847192', 'IPK', 'exon', '92282224', '92282629', '.', '+', '.', 'gene_id "HORVU.MOREX.r3.7HG0666460"; transcript_id "HORVU.MOREX.r3.7HG0666460.1"; ID "agat-exon-12385"; Parent "HORVU.MOREX.r3.7HG0666460.1"; exon_id "HORVU.MOREX.r3.7HG0666460.1-E1"; exon_number "1"; gene_biotype "protein_coding"; gene_name "HvNIP2"; gene_source "IPK"; tag "Ensembl_canonical"; transcript_biotype "protein_coding"; transcript_source "IPK";']
on line 488173 specifies an 'exon' annotation for a transcript HORVU.MOREX.r3.7HG0666460.1, but there is no 'transcript' row in the GTF for HORVU.MOREX.r3.7HG0666460.1 that immediately precedes it. Please fix your GTF and start again."*

Insert a row with sed -i missing transcript

Trying to sort with AGAT (not yet successful)

basically, is missing a 'transcript' row a normal behaviour for AGAT tool?

ADD REPLY • link updated 9 months ago by Juke34 9.2k • written 9 months ago by IrK ▴ 100

0

Entering edit mode

Could you post the complete GTF file on bitbucket or pastebin?

ADD REPLY • link 9 months ago by Michael 55k

0

Entering edit mode

https://bitbucket.org/irkost/gft_file/src/main/ 12_GTF_AGAT_formatted_quotes_rm.gtf (file size 196M)

this file is the output from AGAT (agat_convert_sp_gxf2gxf.pl and agat_convert_sp_gff2gtf.pl) + inner quotes removed

ADD REPLY • link 9 months ago by IrK ▴ 100

0

Entering edit mode

Could you provide more lines before and after the record? Did you modify your file after using AGAT? The transcript sounds to be after instead to before the exon line in your case. AGAT is not suppose to make something like that.

ADD REPLY • link 9 months ago by Juke34 9.2k

1

Entering edit mode

here are few lines from the GTF file, where one row with the missing 'transcript' info was added using sed -i` [7H-0-328847192  IPK transcript  92282224    92282629 ...]

7H-0-328847192  IPK     five_prime_utr  91912842        91913046        .       +       .       gene_id HORVU.MOREX.r3.7HG0666440; transcript_id HORVU.MOREX.r3.7HG0666440.1;
7H-0-328847192  IPK     start_codon     91913047        91913049        .       +       0       gene_id HORVU.MOREX.r3.7HG0666440; transcript_id HORVU.MOREX.r3.7HG0666440.1;
7H-0-328847192  IPK     stop_codon      91916431        91916433        .       +       0       gene_id HORVU.MOREX.r3.7HG0666440; transcript_id HORVU.MOREX.r3.7HG0666440.1;
7H-0-328847192  IPK     three_prime_utr 91916434        91916918        .       +       .       gene_id HORVU.MOREX.r3.7HG0666440; transcript_id HORVU.MOREX.r3.7HG0666440.1;
7H-0-328847192  IPK     gene    92282224        92285885        .       +       .       gene_id HORVU.MOREX.r3.7HG0666460; gene_name HvNIP2;
7H-0-328847192  AGAT    mRNA    92282224        92285885        .       +       .       gene_id HORVU.MOREX.r3.7HG0666460; transcript_id HORVU.MOREX.r3.7HG0666460.1; gene_name HvNIP2;
7H-0-328847192  IPK     exon    92282224        92282629        .       +       .       gene_id HORVU.MOREX.r3.7HG0666460; transcript_id HORVU.MOREX.r3.7HG0666460.1; gene_name HvNIP2;
7H-0-328847192  IPK     transcrip    92282224        92282629        .       +       .       gene_id HORVU.MOREX.r3.7HG0666460; transcript_id HORVU.MOREX.r3.7HG0666460.1; gene_name HvNIP2;
7H-0-328847192  IPK     exon    92282721        92282945        .       +       .       gene_id HORVU.MOREX.r3.7HG0666460; transcript_id HORVU.MOREX.r3.7HG0666460.1; gene_name HvNIP2;
7H-0-328847192  IPK     exon    92284311        92284505        .       +       .       gene_id HORVU.MOREX.r3.7HG0666460; transcript_id HORVU.MOREX.r3.7HG0666460.1; gene_name HvNIP2;
7H-0-328847192  IPK     exon    92284635        92284696        .       +       .       gene_id HORVU.MOREX.r3.7HG0666460; transcript_id HORVU.MOREX.r3.7HG0666460.1; gene_name HvNIP2;
7H-0-328847192  IPK     exon    92285235        92285885        .       +       .       gene_id HORVU.MOREX.r3.7HG0666460; transcript_id HORVU.MOREX.r3.7HG0666460.1; gene_name HvNIP2;
7H-0-328847192  IPK     CDS     92282462        92282629        .       +       0       gene_id HORVU.MOREX.r3.7HG0666460; transcript_id HORVU.MOREX.r3.7HG0666460.1; gene_name HvNIP2;
7H-0-328847192  IPK     CDS     92282721        92282945        .       +       0       gene_id HORVU.MOREX.r3.7HG0666460; transcript_id HORVU.MOREX.r3.7HG0666460.1; gene_name HvNIP2;
7H-0-328847192  IPK     CDS     92284311        92284505        .       +       0       gene_id HORVU.MOREX.r3.7HG0666460; transcript_id HORVU.MOREX.r3.7HG0666460.1; gene_name HvNIP2;
7H-0-328847192  IPK     CDS     92284635        92284696        .       +       0       gene_id HORVU.MOREX.r3.7HG0666460; transcript_id HORVU.MOREX.r3.7HG0666460.1; gene_name HvNIP2;
7H-0-328847192  IPK     CDS     92285235        92285487        .       +       1       gene_id HORVU.MOREX.r3.7HG0666460; transcript_id HORVU.MOREX.r3.7HG0666460.1; gene_name HvNIP2;
7H-0-328847192  IPK     five_prime_utr  92282224        92282461        .       +       .       gene_id HORVU.MOREX.r3.7HG0666460; transcript_id HORVU.MOREX.r3.7HG0666460.1; gene_name HvNIP2;
7H-0-328847192  IPK     start_codon     92282462        92282464        .       +       0       gene_id HORVU.MOREX.r3.7HG0666460; transcript_id HORVU.MOREX.r3.7HG0666460.1; gene_name HvNIP2;
7H-0-328847192  IPK     stop_codon      92285485        92285487        .       +       0       gene_id HORVU.MOREX.r3.7HG0666460; transcript_id HORVU.MOREX.r3.7HG0666460.1; gene_name HvNIP2;
7H-0-328847192  IPK     three_prime_utr 92285488        92285885        .       +       .       gene_id HORVU.MOREX.r3.7HG0666460; transcript_id HORVU.MOREX.r3.7HG0666460.1; gene_name HvNIP2;
7H-0-328847192  AGAT    mRNA    92282224        92285885        .       +       .       gene_id HORVU.MOREX.r3.7HG0666460; transcript_id HORVU.MOREX.r3.7HG0666460.2; gene_name HvNIP2;
7H-0-328847192  IPK     exon    92282224        92282629        .       +       .       gene_id HORVU.MOREX.r3.7HG0666460; transcript_id HORVU.MOREX.r3.7HG0666460.2; gene_name HvNIP2;
7H-0-328847192  IPK     exon    92282721        92282945        .       +       .       gene_id HORVU.MOREX.r3.7HG0666460; transcript_id HORVU.MOREX.r3.7HG0666460.2; gene_name HvNIP2;
7H-0-328847192  IPK     exon    92284311        92284505        .       +       .       gene_id HORVU.MOREX.r3.7HG0666460; transcript_id HORVU.MOREX.r3.7HG0666460.2; gene_name HvNIP2;
7H-0-328847192  IPK     exon    92284635        92285885        .       +       .       gene_id HORVU.MOREX.r3.7HG0666460; transcript_id HORVU.MOREX.r3.7HG0666460.2; gene_name HvNIP2;
7H-0-328847192  IPK     CDS     92282462        92282629        .       +       0       gene_id HORVU.MOREX.r3.7HG0666460; transcript_id HORVU.MOREX.r3.7HG0666460.2; gene_name HvNIP2;
7H-0-328847192  IPK     CDS     92282721        92282945        .       +       0       gene_id HORVU.MOREX.r3.7HG0666460; transcript_id HORVU.MOREX.r3.7HG0666460.2; gene_name HvNIP2;
7H-0-328847192  IPK     CDS     92284311        92284505        .       +       0       gene_id HORVU.MOREX.r3.7HG0666460; transcript_id HORVU.MOREX.r3.7HG0666460.2; gene_name HvNIP2;
7H-0-328847192  IPK     CDS     92284635        92284700        .       +       0       gene_id HORVU.MOREX.r3.7HG0666460; transcript_id HORVU.MOREX.r3.7HG0666460.2; gene_name HvNIP2;
7H-0-328847192  IPK     five_prime_utr  92282224        92282461        .       +       .       gene_id HORVU.MOREX.r3.7HG0666460; transcript_id HORVU.MOREX.r3.7HG0666460.2; gene_name HvNIP2;
7H-0-328847192  IPK     start_codon     92282462        92282464        .       +       0       gene_id HORVU.MOREX.r3.7HG0666460; transcript_id HORVU.MOREX.r3.7HG0666460.2; gene_name HvNIP2;

ADD REPLY • link 9 months ago by IrK ▴ 100

1

Entering edit mode

9 months ago

Juke34 9.2k

How to avoid mRNA feature in output and get only transcript when using AGAT?

To avoid to have mRNAfeature instead of transcript in the 3rd column of you output GTF file you must avoid to perform the convertion in relax mode. The relaxallows all type of feature type in column 3, using GTF version 2.5 or 3 will allow to get only transcript feature.

To avoid this problem prior using AGAT you can modify the config using:

agat config --expose --output_format GTF --gtf_output_version 3

or when running agat_convert_sp_gff2gtf.plplease use the --gtf_version 3 parameter !

ADD COMMENT • link 9 months ago by Juke34 9.2k

0

Entering edit mode

9 months ago

Michael 55k

I assume the problem is the feature type in the third column which is either transcript or mRNA (only 7 occurrences). Likely, your file is composed and only one feature type is supported by the pipeline. Try replacing all occurrences of mRNA in the 3. column with transcript and it may work.

ADD COMMENT • link 9 months ago by Michael 55k

0

Entering edit mode

replaced mRNA to transcript and kept gene_id, transcript_id, gene_name in the 9th column for cellranger. Now getting:

error: Duplicate Gene ID found in GTF: HORVU.MOREX.r3.2HG0189670 grep HORVU.MOREX.r3.2HG0189670:

  2H-301293086-665585731  AGAT    gene    69411162        69412347        .       +       .       gene_id HORVU.MOREX.r3.2HG0189670; transcript_id HORVU.MOREX.r3.2HG0189670.1; gene_name HvTIP2;
    2H-301293086-665585731  IPK     transcript      69411162        69412347        .       +       .       gene_id HORVU.MOREX.r3.2HG0189670; transcript_id HORVU.MOREX.r3.2HG0189670.1; gene_name HvTIP2;
    2H-301293086-665585731  IPK     gene    69411831        69413452        .       +       .       gene_id HORVU.MOREX.r3.2HG0189670; gene_name HvTIP2;
    2H-301293086-665585731  AGAT    transcript      69411831        69413452        .       +       .       gene_id HORVU.MOREX.r3.2HG0189670; transcript_id HORVU.MOREX.r3.2HG0189670.1; gene_name HvTIP2;
    2H-301293086-665585731  IPK     exon    69411831        69412495        .       +       .       gene_id HORVU.MOREX.r3.2HG0189670; transcript_id HORVU.MOREX.r3.2HG0189670.1; gene_name HvTIP2;
    2H-301293086-665585731  IPK     exon    69412738        69412988        .       +       .       gene_id HORVU.MOREX.r3.2HG0189670; transcript_id HORVU.MOREX.r3.2HG0189670.1; gene_name HvTIP2;
    2H-301293086-665585731  IPK     exon    69413118        69413325        .       +       .       gene_id HORVU.MOREX.r3.2HG0189670; transcript_id HORVU.MOREX.r3.2HG0189670.1; gene_name HvTIP2;
    2H-301293086-665585731  AGAT    exon    69413372        69413452        .       +       .       gene_id HORVU.MOREX.r3.2HG0189670; transcript_id HORVU.MOREX.r3.2HG0189670.1; gene_name HvTIP2;
    2H-301293086-665585731  IPK     CDS     69412127        69412495        .       +       0       gene_id HORVU.MOREX.r3.2HG0189670; transcript_id HORVU.MOREX.r3.2HG0189670.1; gene_name HvTIP2;
    2H-301293086-665585731  IPK     CDS     69412738        69412988        .       +       2       gene_id HORVU.MOREX.r3.2HG0189670; transcript_id HORVU.MOREX.r3.2HG0189670.1; gene_name HvTIP2;
    2H-301293086-665585731  IPK     CDS     69413199        69413325        .       +       0       gene_id HORVU.MOREX.r3.2HG0189670; transcript_id HORVU.MOREX.r3.2HG0189670.1; gene_name HvTIP2;
    2H-301293086-665585731  AGAT    five_prime_UTR  69411831        69412126        .       +       .       gene_id HORVU.MOREX.r3.2HG0189670; transcript_id HORVU.MOREX.r3.2HG0189670.1; gene_name HvTIP2;
    2H-301293086-665585731  IPK     five_prime_utr  69413372        69413452        .       +       .       gene_id HORVU.MOREX.r3.2HG0189670; transcript_id HORVU.MOREX.r3.2HG0189670.1; gene_name HvTIP2;
    2H-301293086-665585731  IPK     start_codon     69413447        69413449        .       +       0       gene_id HORVU.MOREX.r3.2HG0189670; transcript_id HORVU.MOREX.r3.2HG0189670.1; gene_name HvTIP2;
    2H-301293086-665585731  IPK     stop_codon      69412490        69412492        .       +       0       gene_id HORVU.MOREX.r3.2HG0189670; transcript_id HORVU.MOREX.r3.2HG0189670.1; gene_name HvTIP2;
    2H-301293086-665585731  IPK     three_prime_utr 69412203        69412347        .       +       .       gene_id HORVU.MOREX.r3.2HG0189670; transcript_id HORVU.MOREX.r3.2HG0189670.1; gene_name HvTIP2;

ADD REPLY • link 9 months ago by IrK ▴ 100

0

Entering edit mode

Different error == solution working :)

The rest of the problem is that you have conflicting gene and transcript entries with the same id but different coordinates. You need to resolve this somehow, most likely the first two lines of your grep output are invalid, at least they do not correspond to exon coordinates. The question is why the file is like this, possibly you are paying the price for manually editing the file. I think it's a bad idea to edit genome annotations manually.

  2H-301293086-665585731  AGAT    gene    69411162        69412347        .       +       .       gene_id HORVU.MOREX.r3.2HG0189670; transcript_id HORVU.MOREX.r3.2HG0189670.1; gene_name HvTIP2;
  2H-301293086-665585731  IPK     gene    69411831        69413452        .       +       .       gene_id HORVU.MOREX.r3.2HG0189670; gene_name HvTIP2;

and

 2H-301293086-665585731  IPK     transcript      69411162        69412347        .       +       .       gene_id HORVU.MOREX.r3.2HG0189670; transcript_id HORVU.MOREX.r3.2HG0189670.1; gene_name HvTIP2;
 2H-301293086-665585731  AGAT    transcript      69411831        69413452        .       +       .       gene_id HORVU.MOREX.r3.2HG0189670; transcript_id HORVU.MOREX.r3.2HG0189670.1; gene_name HvTIP2;

ADD REPLY • link 9 months ago by Michael 55k

0

Entering edit mode

Thank you for your suggestion, that's interesting, because I haven't edited GTF file at any stage manually. I used R, bash commands, AGAT tools, but never opened a file and performed editing. I need to investigate what's going on.

ADD REPLY • link 9 months ago by IrK ▴ 100

0

Entering edit mode

Check if you or some process appended something to the file or the file was modified with some unix utilities (sed,grep,awk).

I used R, bash commands, AGAT tools,

This is almost what I meant by "manually" :)

It may be best to start again from a clean input, e.g. something downloaded from a repository or revert to the original output of your gene prediction. If there is one error there could be more and not necessarily all will throw an error message.

ADD REPLY • link 9 months ago by Michael 55k

0

Entering edit mode

Also you write:

I used AGAT tool to remove any redundancies in 3rd column (gene, transcript, exon),

So there have been some redundant entries from the start (?), even though I don't quite understand what that means.

ADD REPLY • link 9 months ago by Michael 55k

0

Entering edit mode

Hi Michael, thank you for your suggestions -helps a lot to have this support!!! This is my first time working with GTF file and modifying it in a custom way for the cellranger-arc, so it helps a lot to brainstorm this with someone and not all by myself. I think the issue comes from, that I had to split chromosomes in Fasta into a smaller size, e.g initial chromosome size (just made-up coordinates for simplicity) chr1 1:100 becomes chr1vo 1:40 and chr1v2 41:100 in Fasta, and integrate this information into a GTF file, which is unfortunately requires R, bash, ATAC and etc involvement.
As for redundancy, I think cellranger raised a confusing error in the beginning of GTF modification journey (please see below), which I interpreted as "there is no such feature in the GTF", thus, I used the AGAT tool to add this information to the GTF file. Now, I think I just had to sort the GTF file to obtain suggested by Jeffin Rocke and you the proper order gene > transcript > exon > CDS. I have to start from the beginning...

on line 6120 specifies an 'exon' annotation for a transcript HORVU.MOREX.r3.1HG0024880.1, 
but there is no 'transcript' row in the GTF for HORVU.MOREX.r3.1HG0024880.1 that immediately precedes it. 
Please fix your GTF and start again.

ADD REPLY • link 9 months ago by IrK ▴ 100

1

Entering edit mode

Have you looked at this issue in GitHub https://github.com/10XGenomics/cellranger/issues/133

If we can catch the sorting issue I could solve the problem within AGAT

ADD REPLY • link 9 months ago by Juke34 9.2k

0

Entering edit mode

That might be an error from AGAT. I think I have already seen that in previous version of AGAT. What version of AGAT did you use? If not the latest you light be more mucky trying it.

ADD REPLY • link 9 months ago by Juke34 9.2k

0

Entering edit mode

Another GFF Analysis Toolkit (AGAT) - Version: v1.4.0

ADD REPLY • link 9 months ago by IrK ▴ 100

0

Entering edit mode

I checked number of rows I have in each step.

Initial GTF from Ensembl: Hordeum_vulgare.MorexV3_pseudomolecules_assembly.59.gtf - 523,556 (5 lines of header #!)
Modify based on Fasta: 523,551 (5 lines of header are missing)
Apply AGAT tool (add missing features, output in ##gff-version 3 format): agat_convert_sp_gxf2gxf.pl --gtf ${INGTF} --out ${OUTGFF} - 547,202 (1 line is header) AGAT tool added 23,651 extra lines
Apply AGAT tool (convert to GTF format): agat_convert_sp_gff2gtf.pl --gff ${GFFIN}.gff --out ${OUT} - 553,611 (2 lines of header); additional 6409 lines are added for some reason during conversion from GFF to GTF

 gtf-version X

  GFF-like GTF i.e. not checked against any GTF specification. Conversion based on GFF input, standardised by AGAT.

I wonder if it is normal that AGAT tool added so much extra lines in step 2. And I dont think its normal to have extra 6k during file conversion

ADD REPLY • link 9 months ago by IrK ▴ 100

2

Entering edit mode

Now you are seemingly in for a lengthy debugging session. I'd recommend to generate a short version of the initial file consisting of 2-3 gene records only and then see if for every step of your processing the outcome is consistent. And, yes there could be an bug in any of the tools you are using including Agat and cellranger (save basic unix utils, this is very unlikely, but they might work differently from what you expect), as well as in your processing steps.

Then get the final tool (cellranger) to accept the outcome. If that works, scale up.

ADD REPLY • link 9 months ago by Michael 55k

0

Entering edit mode

Step 3 added a wierd record

-301293086-665585731    AGAT    gene    69411162    69412347    .   +   .   gene_id "HORVU.MOREX.r3.2HG0189670"; transcript_id "HORVU.MOREX.r3.2HG0189670.1"; ID "HORVU.MOREX.r3.2HG0189670"; gene_biotype "protein_coding"; gene_name "HvTIP2"; gene_source "IPK"; tag "Ensembl_canonical"; transcript_biotype "protein_coding"; transcript_source "IPK";
2H-301293086-665585731  IPK transcript  69411162    69412347    .   +   .   gene_id "HORVU.MOREX.r3.2HG0189670"; transcript_id "HORVU.MOREX.r3.2HG0189670.1"; ID "agat-transcript-3"; Parent "HORVU.MOREX.r3.2HG0189670"; gene_biotype "protein_coding"; gene_name "HvTIP2"; gene_source "IPK"; tag "Ensembl_canonical"; transcript_biotype "protein_coding"; transcript_source "IPK";
2H-301293086-665585731  IPK gene    69411831    69413452    .   +   .   gene_id "HORVU.MOREX.r3.2HG0189670"; ID "agat-gene-1"; gene_biotype "protein_coding"; gene_name "HvTIP2"; gene_source "IPK";
2H-301293086-665585731  AGAT    mRNA    69411831    69413452    .   +   .   gene_id "HORVU.MOREX.r3.2HG0189670"; transcript_id "HORVU.MOREX.r3.2HG0189670.1"; ID "HORVU.MOREX.r3.2HG0189670.1"; Parent "agat-gene-1"; exon_number "1"; gene_biotype "protein_coding"; gene_name "HvTIP2"; gene_source "IPK"; protein_id "HORVU.MOREX.r3.2HG0189670.1"; tag "Ensembl_canonical"; transcript_biotype "protein_coding"; transcript_source "IPK";
2H-301293086-665585731  IPK exon    69411831    69412495    .   +   .   gene_id "HORVU.MOREX.r3.2HG0189670"; transcript_id "HORVU.MOREX.r3.2HG0189670.1"; ID "agat-exon-1741"; Parent "HORVU.MOREX.r3.2HG0189670.1"; exon_id "HORVU.MOREX.r3.2HG0189670.1-E3"; exon_number "3"; gene_biotype "protein_coding"; gene_name "HvTIP2"; gene_source "IPK"; tag "Ensembl_canonical"; transcript_biotype "protein_coding"; transcript_source "IPK";
2H-301293086-665585731  IPK exon    69412738    69412988    .   +   .   gene_id "HORVU.MOREX.r3.2HG0189670"; transcript_id "HORVU.MOREX.r3.2HG0189670.1"; ID "agat-exon-1740"; Parent "HORVU.MOREX.r3.2HG0189670.1"; exon_id "HORVU.MOREX.r3.2HG0189670.1-E2"; exon_number "2"; gene_biotype "protein_coding"; gene_name "HvTIP2"; gene_source "IPK"; tag "Ensembl_canonical"; transcript_biotype "protein_coding"; transcript_source "IPK";
2H-301293086-665585731  IPK exon    69413118    69413325    .   +   .   gene_id "HORVU.MOREX.r3.2HG0189670"; transcript_id "HORVU.MOREX.r3.2HG0189670.1"; ID "agat-exon-1739"; Parent "HORVU.MOREX.r3.2HG0189670.1"; exon_id "HORVU.MOREX.r3.2HG0189670.1-E1"; exon_number "1"; gene_biotype "protein_coding"; gene_name "HvTIP2"; gene_source "IPK"; tag "Ensembl_canonical"; transcript_biotype "protein_coding"; transcript_source "IPK";
2H-301293086-665585731  AGAT    exon    69413372    69413452    .   +   .   gene_id "HORVU.MOREX.r3.2HG0189670"; transcript_id "HORVU.MOREX.r3.2HG0189670.1"; ID "agat-exon-17319"; Parent "HORVU.MOREX.r3.2HG0189670.1"; exon_number "1"; gene_biotype "protein_coding"; gene_name "HvTIP2"; gene_source "IPK"; protein_id "HORVU.MOREX.r3.2HG0189670.1"; tag "Ensembl_canonical"; transcript_biotype "protein_coding"; transcript_source "IPK";
2H-301293086-665585731  IPK CDS 69412127    69412495    .   +   0   gene_id "HORVU.MOREX.r3.2HG0189670"; transcript_id "HORVU.MOREX.r3.2HG0189670.1"; ID "agat-cds-27433"; Parent "HORVU.MOREX.r3.2HG0189670.1"; exon_number "3"; gene_biotype "protein_coding"; gene_name "HvTIP2"; gene_source "IPK"; protein_id "HORVU.MOREX.r3.2HG0189670.1"; tag "Ensembl_canonical"; transcript_biotype "protein_coding"; transcript_source "IPK";
2H-301293086-665585731  IPK CDS 69412738    69412988    .   +   2   gene_id "HORVU.MOREX.r3.2HG0189670"; transcript_id "HORVU.MOREX.r3.2HG0189670.1"; ID "agat-cds-27432"; Parent "HORVU.MOREX.r3.2HG0189670.1"; exon_number "2"; gene_biotype "protein_coding"; gene_name "HvTIP2"; gene_source "IPK"; protein_id "HORVU.MOREX.r3.2HG0189670.1"; tag "Ensembl_canonical"; transcript_biotype "protein_coding"; transcript_source "IPK";
2H-301293086-665585731  IPK CDS 69413199    69413325    .   +   0   gene_id "HORVU.MOREX.r3.2HG0189670"; transcript_id "HORVU.MOREX.r3.2HG0189670.1"; ID "agat-cds-27431"; Parent "HORVU.MOREX.r3.2HG0189670.1"; exon_number "1"; gene_biotype "protein_coding"; gene_name "HvTIP2"; gene_source "IPK"; protein_id "HORVU.MOREX.r3.2HG0189670.1"; tag "Ensembl_canonical"; transcript_biotype "protein_coding"; transcript_source "IPK";
2H-301293086-665585731  AGAT    five_prime_UTR  69411831    69412126    .   +   .   gene_id "HORVU.MOREX.r3.2HG0189670"; transcript_id "HORVU.MOREX.r3.2HG0189670.1"; ID "agat-five_prime_utr-23204"; Parent "HORVU.MOREX.r3.2HG0189670.1"; exon_id "HORVU.MOREX.r3.2HG0189670.1-E1"; exon_number "1"; gene_biotype "protein_coding"; gene_name "HvTIP2"; gene_source "IPK"; tag "Ensembl_canonical"; transcript_biotype "protein_coding"; transcript_source "IPK";
2H-301293086-665585731  IPK five_prime_utr  69413372    69413452    .   +   .   gene_id "HORVU.MOREX.r3.2HG0189670"; transcript_id "HORVU.MOREX.r3.2HG0189670.1"; ID "agat-five_prime_utr-3550"; Parent "HORVU.MOREX.r3.2HG0189670.1"; gene_biotype "protein_coding"; gene_name "HvTIP2"; gene_source "IPK"; tag "Ensembl_canonical"; transcript_biotype "protein_coding"; transcript_source "IPK";
2H-301293086-665585731  IPK start_codon 69413447    69413449    .   +   0   gene_id "HORVU.MOREX.r3.2HG0189670"; transcript_id "HORVU.MOREX.r3.2HG0189670.1"; ID "agat-start_codon-6049"; Parent "HORVU.MOREX.r3.2HG0189670.1"; exon_number "1"; gene_biotype "protein_coding"; gene_name "HvTIP2"; gene_source "IPK"; tag "Ensembl_canonical"; transcript_biotype "protein_coding"; transcript_source "IPK";
2H-301293086-665585731  IPK stop_codon  69412490    69412492    .   +   0   gene_id "HORVU.MOREX.r3.2HG0189670"; transcript_id "HORVU.MOREX.r3.2HG0189670.1"; ID "agat-stop_codon-6042"; Parent "HORVU.MOREX.r3.2HG0189670.1"; exon_number "3"; gene_biotype "protein_coding"; gene_name "HvTIP2"; gene_source "IPK"; tag "Ensembl_canonical"; transcript_biotype "protein_coding"; transcript_source "IPK";
2H-301293086-665585731  IPK three_prime_utr 69412203    69412347    .   +   .   gene_id "HORVU.MOREX.r3.2HG0189670"; transcript_id "HORVU.MOREX.r3.2HG0189670.1"; ID "agat-three_prime_utr-3529"; Parent "HORVU.MOREX.r3.2HG0189670.1"; gene_biotype "protein_coding"; gene_name "HvTIP2"; gene_source "IPK"; tag "Ensembl_canonical"; transcript_biotype "protein_coding"; transcript_source "IPK";

I would need to see the original file (at list for 2H-301293086-665585731 ) to be able to see if the problem is from AGAT or the original file.

ADD REPLY • link 9 months ago by Juke34 9.2k

score 1 · Accepted Answer · 2024-10-08

1

Entering edit mode

8 months ago

IrK ▴ 100

I managed to generate a reference with cellranger-arc mkref, so I wanted to conclude this question. cellranger-arc was giving me misleading error messages that my GTF file has duplicates, or some gene features are missing so that led me into completely wrong direction.

The workflow that worked for me:

Modify GTF file based on Fasta
Makes sure you have the right order (hierarchy) in the 3rd '"feature" column: gene > transcript > exon ...etc
Retain only gene_id, transcript_ids, and gene_name in the "attributes" 9th column

So, at the end I did not use the AGAT tool, although its good to know this tool exists. Hope it helps to someone who is like me first time modifying GTF file for building a reference.

And also wanted to thank Juke34 and Michael for your comments and suggestions - that was tremendous support

ADD COMMENT • link 8 months ago by IrK ▴ 100

0

Entering edit mode

Great it worked out. Did you in the end also replace mRNA with transcript or did it accept mRNA features?

ADD REPLY • link 8 months ago by Michael 55k

1

Entering edit mode

The original gtf file has 'transcript' annotation, so I didn't have to change it.

Before, when I was using the AGAT, that's then the 'transcript' was replaced with 'mRNA' and I haven't realised that.

ADD REPLY • link 8 months ago by IrK ▴ 100