Question

nf-core/rnaseq, problem with ensembl reference

1

Entering edit mode

17 months ago

Filip ▴ 10

Dear community, I am running rna-seq pipeline from nfcore,

sudo nextflow run nf-core/rnaseq \
--input microsheet.csv \
--outdir rnaseq \
--skip_alignment \
--pseudo_aligner salmon \
--fasta references/ensembl/Homo_sapiens.GRCh38.dna.primary_assembly.fa \
--gtf references/ensembl/Homo_sapiens.GRCh38.111.gtf \
--transcript_fasta references/ensembl/Homo_sapiens.GRCh38.cdna.all.fa \
--max_memory 50GB \
--max_cpus 18 \
-profile docker \

And I got an error:

ERROR ~ Error executing process > 'NFCORE_RNASEQ:RNASEQ:QUANTIFY_PSEUDO_ALIGNMENT:TX2GENE Caused by:
Missing output file(s) *.tsv expected by process NFCORE_RNASEQ:RNASEQ:QUANTIFY_PSEUDO_ALIGNMENT:TX2GENE (Homo_sapiens.GRCh38.dna.primary_assembly.filtered.gtf)

Command error: __main__ - 2024-03-18 11:10:51,695 WARNING: No attribute in GTF matching transcripts __main__ - 2024-03-18 11:10:51,695 ERROR: Failed to map transcripts to genes.

My reference comes from ensembl, and upon checking the files I discovered that the .gtf file contains transcript_id like this: transcript_id "ENST00000511072" While my counts, spawned from transcriptome reference are named like this: ENST00000390469.2

I can't find gtf file from ensembl that contains the information about version (.1, .2 etc.). Could the version be causing the error? It is suprising that the pipeline doesn't check for this?

Any advise is much appreciated. Thank you

nfcore nextflow ensembl rnaseq • 2.3k views

ADD COMMENT • link updated 10 months ago by Kai • 0 • written 17 months ago by Filip ▴ 10

0

Entering edit mode

Since you are using salmon you should not need the GTF file. Can you try taking that out?

ADD REPLY • link 17 months ago by GenoMax 153k

0

Entering edit mode

It is specified in the nfcore docs that I need it:

However, you can provide the --skip_alignment parameter if you would like to run Salmon or Kallisto in isolation. By default, the pipeline will use the genome fasta and gtf file to generate the transcripts fasta file, and then to build the Salmon index.

I tried running it to confirm and got:

No GTF or GFF3 annotation specified! The pipeline requires at least one of these files.

Having said that, I actually obtained the quant.sf files from salmon, it is the TX2GENE step that fails.

ADD REPLY • link 17 months ago by Filip ▴ 10

0

Entering edit mode

Did you ever manage to solve this? Im running into the exact same problem...

ADD REPLY • link 13 months ago by christiantd • 0

score 0 · Answer 1 · 2024-11-04

Hello,

This was something that I came across fairly recently. It's somewhat misleading, as it does appear to be an issue with the transcript fasta as oppose to the gtf directly.

In my case, the error is that the transript_fasta will contain Ensembl ID's that the transcript version appended to the end, e.g. "ENSG0000013961.1". In my case, modifying the transcript fasta to remove these was how I solved my error (I also removed genome versions, but I do not know if this was necessary to actually solve the error/is best practice).

a script such as:

#!/bin/bash

# Define the input file

input_file="input fasta"

output_file="output fasta"

# Use sed to perform both replacements and save it to a new file, pretty sure ENSG and ENST are the respective for human, but you'll need to change them if your species is different, just open the file and check the fasta header.

sed -e 's/^$>ENST[0-9]*$\.[0-9]*/\1/' -e 's/$gene:ENSG[0-9]*$\.[0-9]*/\1/' "$input_file" > "$output_file"

echo "Replacement completed. Modified file saved as $output_file."

Might do the job

Hope that helps anyone who comes across this in the future! :)