I am doing some RNA analysis and am having issues trying to generate count data. I mapped my reads to a reference genome fasta file (genbank fasta file from ncbi) using bbmap and .sam files as the output.
I am now trying to use featurecounts to generate count data but am running into issues with my gtf file. The gtf file is also from ncbi and corresponds to the alignment fasta file used for mapping. This is it's structure:
#gtf-version 2.2
#!genome-build ASM14559v1
#!genome-build-accession NCBI_Assembly:GCA_000145595.1
CP002120.1 Genbank gene 1 1362 . + . gene_id "SAA6008_00001"; transcript_id ""; gbkey "Gene"; gene "dnaA"; gene_biotype "protein_coding"; locus_tag "SAA6008_00001";
CP002120.1 Genbank CDS 1 1359 . + 0 gene_id "SAA6008_00001"; transcript_id "unassigned_transcript_1"; gbkey "CDS"; gene "dnaA"; locus_tag "SAA6008_00001"; product "chromosomal replication initiator protein DnaA"; protein_id "ADL64089.1"; transl_table "11"; exon_number "1";
CP002120.1 Genbank start_codon 1 3 . + 0 gene_id "SAA6008_00001"; transcript_id "unassigned_transcript_1"; gbkey "CDS"; gene "dnaA"; locus_tag "SAA6008_00001"; product "chromosomal replication initiator protein DnaA"; protein_id "ADL64089.1"; transl_table "11"; exon_number "1";
CP002120.1 Genbank stop_codon 1360 1362 . + 0 gene_id "SAA6008_00001"; transcript_id "unassigned_transcript_1"; gbkey "CDS"; gene "dnaA"; locus_tag "SAA6008_00001"; product "chromosomal replication initiator protein DnaA"; protein_id "ADL64089.1"; transl_table "11"; exon_number "1";
CP002120.1 Genbank gene 1640 2773 . + . gene_id "SAA6008_00002"; transcript_id ""; gbkey "Gene"; gene "dnaN"; gene_biotype "protein_coding"; locus_tag "SAA6008_00002";
CP002120.1 Genbank CDS 1640 2770 . + 0 gene_id "SAA6008_00002"; transcript_id "unassigned_transcript_2"; gbkey "CDS"; gene "dnaN"; locus_tag "SAA6008_00002"; product "DNA-directed DNA polymerase beta subunit"; protein_id "ADL64090.1"; transl_table "11"; exon_number "1";
CP002120.1 Genbank start_codon 1640 1642 . + 0 gene_id "SAA6008_00002"; transcript_id "unassigned_transcript_2"; gbkey "CDS"; gene "dnaN"; locus_tag "SAA6008_00002"; product "DNA-directed DNA polymerase beta subunit"; protein_id "ADL64090.1"; transl_table "11"; exon_number "1";
CP002120.1 Genbank stop_codon 2771 2773 . + 0 gene_id "SAA6008_00002"; transcript_id "unassigned_transcript_2"; gbkey "CDS"; gene "dnaN"; locus_tag "SAA6008_00002"; product "DNA-directed DNA polymerase beta subunit"; protein_id "ADL64090.1"; transl_table "11"; exon_number "1";
CP002120.1 Genbank gene 3163 3399 . + . gene_id "SAA6008_00003"; transcript_id ""; gbkey "Gene"; gene_biotype "protein_coding"; locus_tag "SAA6008_00003";
CP002120.1 Genbank CDS 3163 3396 . + 0 gene_id "SAA6008_00003"; transcript_id "unassigned_transcript_3"; gbkey "CDS"; locus_tag "SAA6008_00003"; product "conserved hypothetical protein"; protein_id "ADL64091.1"; transl_table "11"; exon_number "1";
CP002120.1 Genbank start_codon 3163 3165 . + 0 gene_id "SAA6008_00003"; transcript_id "unassigned_transcript_3"; gbkey "CDS"; locus_tag "SAA6008_00003"; product "conserved hypothetical protein"; protein_id "ADL64091.1"; transl_table "11"; exon_number "1";
CP002120.1 Genbank stop_codon 3397 3399 . + 0 gene_id "SAA6008_00003"; transcript_id "unassigned_transcript_3"; gbkey "CDS"; locus_tag "SAA6008_00003"; product "conserved hypothetical protein"; protein_id "ADL64091.1"; transl_table "11"; exon_number "1";
CP002120.1 Genbank gene 3396 4508 . + . gene_id "SAA6008_00004"; transcript_id ""; gbkey "Gene"; gene "recF"; gene_biotype "protein_coding"; locus_tag "SAA6008_00004";
CP002120.1 Genbank CDS 3396 4505 . + 0 gene_id "SAA6008_00004"; transcript_id "unassigned_transcript_4"; gbkey "CDS"; gene "recF"; locus_tag "SAA6008_00004"; note "Required for DNA replication; binds preferentially to single-stranded, linear DNA"; product "DNA repair and genetic recombination protein"; protein_id "ADL64092.1"; t$
CP002120.1 Genbank start_codon 3396 3398 . + 0 gene_id "SAA6008_00004"; transcript_id "unassigned_transcript_4"; gbkey "CDS"; gene "recF"; locus_tag "SAA6008_00004"; note "Required for DNA replication; binds preferentially to single-stranded, linear DNA"; product "DNA repair and genetic recombination protein"; protein_id "ADL640$
CP002120.1 Genbank stop_codon 4506 4508 . + 0 gene_id "SAA6008_00004"; transcript_id "unassigned_transcript_4"; gbkey "CDS"; gene "recF"; locus_tag "SAA6008_00004"; note "Required for DNA replication; binds preferentially to single-stranded, linear DNA"; product "DNA repair and genetic recombination protein"; protein_id "ADL640$
CP002120.1 Genbank gene 4518 6452 . + . gene_id "SAA6008_00005"; transcript_id ""; gbkey "Gene"; gene "gyrB"; gene_biotype "protein_coding"; locus_tag "SAA6008_00005";
CP002120.1 Genbank CDS 4518 6449 . + 0 gene_id "SAA6008_00005"; transcript_id "unassigned_transcript_5"; gbkey "CDS"; gene "gyrB"; locus_tag "SAA6008_00005"; note "TIGRFAM: DNA gyrase, B subunit PFAM: DNA gyrase, subunit B domain protein; ATP-binding region, ATPase domain protein domain protein; TOPRIM
When I try to run featurecounts in R using this code:
featureCounts("s1_mapped.sam", annot.ext = "GCA_000145595.1_ASM14559v1_genomic.gtf", isGTFAnnotationFile = T)
I get the following error:
The chromosome name contains unexpected characters: "CP002120.1 Staphylococcus aureus subsp. aureus str. JKD6008, complete genome" (77 chars)
featureCounts has to stop running
FATAL Error: The program has to terminate and no counting file is generated.
Error in featureCounts("s1_mapped.sam", annot.ext = "GCA_000145595.1_ASM14559v1_genomic.gtf", :
No counts were generated.
Do I need to format my gtf file in some way? I have also tried the code with GTF.featureType = "gene" but I get the same error.
You are correct the headers in the fasta file are:
I will try the reformat.sh option, if that doesn't work how would I go about doing option 1?
What aligner did you use? Looks like it passed the entire fasta header from your reference file along (not many aligners do that, they generally truncate the headers after first white space).
You will need to edit the header in reference leaving just the accession. Save that file and re-build a new index, followed by re-alignment of data.
I used bbmap. Ok I will continue with option 2, if that doesn't work I will delete the unwanted characters and re-do the mapping.
Aha, if you used BBMap then option 2 will work for sure since you have BBMap installed.
In future remember to use option
trd=t
when you align the data usingbbmap.sh
. This will truncate reference names after first space. BBMap is one of the rare aligners that passes along the entire fasta header to index and then on to the alignments, unless you truncate the names using the optiontrd=t
.Awesome thank you for your help!
Ok so remapping after removing the extra string worked and featurecounts now works so thank you. The only thing I don't like is it is using the gene id (example SAA6008_00001) rather than the gene name (dnaA). Is there a way to run featurecounts so that it uses the gene name instead? I tried using GTF.attrType = "gene" but I get this error:
Which I don't understand, it can find all the other columns but not gene
Are you adding
GTF.featureType = "CDS"
? That would be needed forgene
to work.