Question

RNA-seq using rasflow - issues with annotation file

0

Entering edit mode

2.8 years ago

cdeantoneo31 ▴ 20

Hi everyone!

I have used rasflow in the past to process and analyze RNA-sequencing data in human and mouse cells, but now I am trying to do the same with a viral genome and seem to be running into issues with the annotation file.

the genome and annotation I'm using is available here, and the gff looks as below

##gff-version 3
#!gff-spec-version 1.21
#!processor NCBI annotwriter
#!genome-build ASM629838v1
#!genome-build-accession NCBI_Assembly:GCA_006298385.1
##sequence-region HM159613.1 1 3860
##species https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=10886
HM159613.1  Genbank region  1   3860    .   +   .   ID=HM159613.1:1..3860;Dbxref=taxon:10886;Name=L1;collected-by=Bernard N. Fields;collection-date=Jan-2002;country=USA;gbkey=Src;isolate=Dearing;mol_type=genomic RNA;nat-host=Homo sapiens;old-name=Mammalian Orthoreovirus 3 Strain Dearing;segment=L1;serotype=3;strain=T3D
HM159613.1  Genbank gene    25  3828    .   +   .   ID=gene-L1;Name=L1;gbkey=Gene;gene=L1;gene_biotype=protein_coding
HM159613.1  Genbank CDS 25  3828    .   +   0   ID=cds-ADJ00316.1;Parent=gene-L1;Dbxref=NCBI_GP:ADJ00316.1;Name=ADJ00316.1;gbkey=CDS;gene=L1;product=lambda-3 protein;protein_id=ADJ00316.1
##sequence-region HM159614.1 1 3916
##species https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=10886
HM159614.1  Genbank region  1   3916    .   +   .   ID=HM159614.1:1..3916;Dbxref=taxon:10886;Name=L2;collected-by=Bernard N. Fields;collection-date=Jan-2002;country=USA;gbkey=Src;isolate=Dearing;mol_type=genomic RNA;nat-host=Homo sapiens;old-name=Mammalian Orthoreovirus 3 Strain Dearing;segment=L2;serotype=3;strain=T3D
HM159614.1  Genbank gene    14  3883    .   +   .   ID=gene-L2;Name=L2;gbkey=Gene;gene=L2;gene_biotype=protein_coding
HM159614.1  Genbank CDS 14  3883    .   +   0   ID=cds-ADJ00317.1;Parent=gene-L2;Dbxref=NCBI_GP:ADJ00317.1;Name=ADJ00317.1;gbkey=CDS;gene=L2;product=lambda-2 protein;protein_id=ADJ00317.1
##sequence-region HM159615.1 1 3901
##species https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=10886
HM159615.1  Genbank region  1   3901    .   +   .   ID=HM159615.1:1..3901;Dbxref=taxon:10886;Name=L3;collected-by=Bernard N. Fields;collection-date=Jan-2002;country=USA;gbkey=Src;isolate=Dearing;mol_type=genomic RNA;nat-host=Homo sapiens;old-name=Mammalian Orthoreovirus 3 Strain Dearing;segment=L3;serotype=3;strain=T3D
HM159615.1  Genbank gene    14  3841    .   +   .   ID=gene-L3;Name=L3;gbkey=Gene;gene=L3;gene_biotype=protein_coding
HM159615.1  Genbank CDS 14  3841    .   +   0   ID=cds-ADJ00318.1;Parent=gene-L3;Dbxref=NCBI_GP:ADJ00318.1;Name=ADJ00318.1;gbkey=CDS;gene=L3;product=lambda-1 protein;protein_id=ADJ00318.1

When I run rasflow, I get the following error

       ==========     _____ _    _ ____  _____  ______          _____  
        =====         / ____| |  | |  _ \|  __ \|  ____|   /\   |  __ \ 
          =====      | (___ | |  | | |_) | |__) | |__     /  \  | |  | |
            ====      \___ \| |  | |  _ <|  _  /|  __|   / /\ \ | |  | |
              ====    ____) | |__| | |_) | | \ \| |____ / ____ \| |__| |
        ==========   |_____/ \____/|____/|_|  \_\______/_/    \_\_____/
      v1.6.4

//========================== featureCounts setting ===========================\\
||                                                                            ||
||             Input files : 1 BAM file                                       ||
||                           P D2.sort.bam                                    ||
||                                                                            ||
||             Output file : D2_count.tsv                                     ||
||                 Summary : D2_count.tsv.summary                             ||
||              Annotation : GCA_006298385.1_ASM629838v1_genomic.gff (GTF)    ||
||      Dir for temp files : output/input_v_IP/genome/countFile               ||
||                                                                            ||
||                 Threads : 24                                               ||
||                   Level : meta-feature level                               ||
||              Paired-end : yes                                              ||
||      Multimapping reads : not counted                                      ||
|| Multi-overlapping reads : not counted                                      ||
||   Min overlapping bases : 1                                                ||
||                                                                            ||
||          Chimeric reads : counted                                          ||
||        Both ends mapped : not required                                     ||
||                                                                            ||
\\============================================================================//

//================================= Running ==================================\\
||                                                                            ||
|| Load annotation file GCA_006298385.1_ASM629838v1_genomic.gff ...           ||
ERROR: no features were loaded in format GTF. The annotation format can be specified by the '-F' option, and the required feature type can be specified by the '-t' option..
The porgram has to terminate.

[Thu Oct  6 17:30:51 2022]
Error in rule featureCount:
    jobid: 0
    output: output/input_v_IP/genome/countFile/D2_count.tsv, output/input_v_IP/genome/countFile/D2_count.tsv.summary

RuleException:
CalledProcessError in line 109 of /home/carly/Desktop/RASflow/workflow/align_count_genome.rules:
Command ' set -euo pipefail;  featureCounts -p -T 24 -t exon -g region -a ref/annotation/GCA_006298385.1_ASM629838v1_genomic.gff -o output/input_v_IP/genome/countFile/D2_count.tsv data/output/input_v_IP/genome/bamFileSort/D2.sort.bam && tail -n +3 output/input_v_IP/genome/countFile/D2_count.tsv | cut -f1,7 > temp.D2 && mv temp.D2 output/input_v_IP/genome/countFile/D2_count.tsv ' returned non-zero exit status 255.
  File "/home/carly/Desktop/RASflow/workflow/align_count_genome.rules", line 109, in __rule_featureCount
  File "/home/carly/miniconda3/envs/rasflow/lib/python3.6/concurrent/futures/thread.py", line 56, in run
Exiting because a job execution failed. Look above for error message

I've tried using gff read to convert it to a gtf, and get the same error. This is what the converted file looks like

HM159613.1  Genbank transcript  25  3828    .   +   .   transcript_id "gene-L1"; gene_id "gene-L1"; gene_name "L1"
HM159613.1  Genbank CDS 25  3828    .   +   0   transcript_id "gene-L1"; gene_name "L1";
HM159614.1  Genbank transcript  14  3883    .   +   .   transcript_id "gene-L2"; gene_id "gene-L2"; gene_name "L2"
HM159614.1  Genbank CDS 14  3883    .   +   0   transcript_id "gene-L2"; gene_name "L2";
HM159615.1  Genbank transcript  14  3841    .   +   .   transcript_id "gene-L3"; gene_id "gene-L3"; gene_name "L3"
HM159615.1  Genbank CDS 14  3841    .   +   0   transcript_id "gene-L3"; gene_name "L3";
HM159616.1  Genbank transcript  14  2224    .   +   .   transcript_id "gene-M1"; gene_id "gene-M1"; gene_name "M1"
HM159616.1  Genbank CDS 14  2224    .   +   0   transcript_id "gene-M1"; gene_name "M1";
HM159617.1  Genbank transcript  31  2160    .   +   .   transcript_id "gene-M2"; gene_id "gene-M2"; gene_name "M2"
HM159617.1  Genbank CDS 31  2160    .   +   0   transcript_id "gene-M2"; gene_name "M2";
HM159618.1  Genbank transcript  19  2184    .   +   .   transcript_id "gene-M3"; gene_id "gene-M3"; gene_name "M3"
HM159618.1  Genbank CDS 19  2184    .   +   0   transcript_id "gene-M3"; gene_name "M3";
HM159619.1  Genbank transcript  13  1380    .   +   .   transcript_id "gene-S1"; gene_id "gene-S1"; gene_name "S1"
HM159619.1  Genbank CDS 13  1380    .   +   0   transcript_id "gene-S1"; gene_name "S1";
HM159620.1  Genbank transcript  19  1275    .   +   .   transcript_id "gene-S2"; gene_id "gene-S2"; gene_name "S2"
HM159620.1  Genbank CDS 19  1275    .   +   0   transcript_id "gene-S2"; gene_name "S2";
HM159621.1  Genbank transcript  28  1128    .   +   .   transcript_id "gene-S3"; gene_id "gene-S3"; gene_name "S3"
HM159621.1  Genbank CDS 28  1128    .   +   0   transcript_id "gene-S3"; gene_name "S3";
HM159622.1  Genbank transcript  33  1130    .   +   .   transcript_id "gene-S4"; gene_id "gene-S4"; gene_name "S4"
HM159622.1  Genbank CDS 33  1130    .   +   0   transcript_id "gene-S4"; gene_name "S4";

does anyone have a solution to this?

gtf rasflow gff RNA-seq • 722 views

ADD COMMENT • link updated 2.4 years ago by Ram 45k • written 2.8 years ago by cdeantoneo31 ▴ 20

0

Entering edit mode

I've figured out how to fix this, and I want to put the solution down here for future readers who may have a similar issue.

I've gone into the align_count_genomes.rules file in workflow, and changed "exon" to CDS in the featurecounts area, so it now reads

rule featureCount:
    input:
        sort = intermediate_path + "/bamFileSort/{sample}.sort.bam",
        annotation = config["ANNOTATION"]
    output:
        count = final_path + "/countFile/{sample}_count.tsv",
        count_summary = final_path + "/countFile/{sample}_count.tsv.summary"
    run:
        if config["COUNTER"]=="featureCounts":
            if config["END"]=="pair":
                shell("featureCounts -p -T {config[NCORE]} -t CDS -g {config[ATTRIBUTE]} -a {input.annotation} -o {output.count} {input.sort} && tail -n +3 {output.count} | cut -f1,7 > temp.{wildcards.sample} && mv temp.{wildcards.sample} {output.count}")
            else:
                shell("featureCounts  -T {config[NCORE]} -t CDS -g {config[ATTRIBUTE]} -a {input.annotation} -o {output.count} {input.sort} && tail -n +3 {output.count} | cut -f1,7 > temp.{wildcards.sample} && mv temp.{wildcards.sample} {output.count}")
        elif config["COUNTER"]=="htseq-count":
            shell("htseq-count -f bam -i {config[ATTRIBUTE]} -s no -t CDS {input.sort} {input.annotation} | sed '/^__/ d' > {output.count}")

Based on the mouse gtf file ive used in the past, I am guessing the original was giving an error because the third column of the mouse gtf has "exon" while this one has only CDS or transcript. I think changing the code in the rules file fixed this. I was also able to just use the original gff file

ADD REPLY • link 2.8 years ago by cdeantoneo31 ▴ 20