annotation of bin to get gene ID
1
1
Entering edit mode
2.8 years ago
salmon ▴ 10

Hello,

I have my DNA sequences for pure culture bacteria. I did sickle, megahit and metabat to obtain the bin. The completeness of abtained bin is 98.04% (through CheckM). I annotated the bin using Prokka and I got the output as gff file.

The gff file has following format:

k141_1006   Prodigal:002006 CDS 352 882 .   +   0   ID=bin1_00001;eC_number=3.4.25.2;Name=hslV;db_xref=COG:COG5405;gene=hslV;inference=ab initio prediction:Prodigal:002006,similar to AA sequence:UniProtKB:P0A7B8;locus_tag=bin1_00001;product=ATP-dependent protease subunit HslV
k141_1006   Prodigal:002006 CDS 920 2257    .   +   0   ID=bin1_00002;Name=hslU;db_xref=COG:COG1220;gene=hslU;inference=ab initio prediction:Prodigal:002006,similar to AA sequence:UniProtKB:P0A6H5;locus_tag=bin1_00002;product=ATP-dependent protease ATPase subunit HslU
k141_5512   Prodigal:002006 CDS 154 1296    .   +   0   ID=bin1_00003;eC_number=1.1.1.290;Name=pdxB_1;db_xref=COG:COG0111;gene=pdxB_1;inference=ab initio prediction:Prodigal:002006,similar to AA sequence:UniProtKB:Q9I3W9;locus_tag=bin1_00003;product=Erythronate-4-phosphate dehydrogenase
k141_5512   Prodigal:002006 CDS 1350    1523    .   -   0   ID=bin1_00004;inference=ab initio prediction:Prodigal:002006;locus_tag=bin1_00004;product=hypothetical protein
k141_5512   Prodigal:002006 CDS 1769    3526    .   +   0   ID=bin1_00005;eC_number=7.5.2.6;Name=msbA_1;db_xref=COG:COG1132;gene=msbA_1;inference=ab initio prediction:Prodigal:002006,similar to AA sequence:UniProtKB:Q9KQW9;locus_tag=bin1_00005;product=Lipid A export ATP-binding/permease protein MsbA
k141_5512   Prodigal:002006 CDS 3590    4270    .   -   0   ID=bin1_00006;inference=ab initio prediction:Prodigal:002006;locus_tag=bin1_00006;product=hypothetical protein

I am unsure how to get different gene ID so that I can use them for feature count. I can see for some contig it given the eC number but for other it didnt give any. Can someone please help.

Thank you in advance

geneID Prokka • 1.8k views
ADD COMMENT
0
Entering edit mode
2.8 years ago

The EC number you mention is a 'functional annotation' referring to the EC classification for enzymes.

To use in things as FeatureCounts you don't need those, you need the IDs that are denoted as 'ID=' in that GFF file (eg. ID=bin1_00001 ; ID=bin1_00002 ; ID=bin1_00003 ) , I would assume those are unique in your GFF file and thus those are the ones you can use for featureCount.

I don't know PROKKA enough to tell if there is some kind of parameter you can set to have a different ID naming schema.

ADD COMMENT
0
Entering edit mode

Thank you for your reply I tried to used feature count after using ID = bin1_00001, bin1_00002...so on but its not generating any counts The output is

//========================== featureCounts setting ===========================\\
||                                                                            ||
||             Input files : 1 BAM file                                       ||
||                           S J2.bam.sorted.bam                              ||
||                                                                            ||
||              Annotation : R data.frame                                     ||
||      Dir for temp files : .                                                ||
||                 Threads : 1                                                ||
||                   Level : meta-feature level                               ||
||              Paired-end : no                                               ||
||      Multimapping reads : counted                                          ||
|| Multi-overlapping reads : not counted                                      ||
||   Min overlapping bases : 1                                                ||
||                                                                            ||
\\===================== http://subread.sourceforge.net/ ======================//

//================================= Running ==================================\\
||                                                                            ||
|| Load annotation file .Rsubread_UserProvidedAnnotation_pid29252 ...         ||

Error: Line 5332 contains a format error. The expected annotation format is SAF.
Error in featureCounts(bamFiles, annot.ext = smgff) :
  No counts were generated.

I can see that my sorted Bam file also has same ID. I am able to pull some mid lines of bam file.

   @SQ     SN:bin1_05318   LN:47
    @SQ     SN:bin1_05319   LN:180
    @SQ     SN:bin1_05320   LN:293
    @SQ     SN:bin1_05321   LN:217
    @SQ     SN:bin1_05322   LN:62
    @SQ     SN:bin1_05323   LN:213
    @SQ     SN:bin1_05324   LN:77
    @SQ     SN:bin1_05325   LN:465
    @SQ     SN:bin1_05326   LN:573

The head of my gff file is

        Chr Start  End Strand     GeneID
1 k141_1006   920 2257      + bin1_00002
2 k141_5512   154 1296      + bin1_00003
3 k141_5512  1350 1523      - bin1_00004
4 k141_5512  1769 3526      + bin1_00005
5 k141_5512  3590 4270      - bin1_00006
6 k141_5512  4360 5346      + bin1_00007

I am not sure where I am going wrong. I tried many times. Any help highly appreciated.

ADD REPLY
0
Entering edit mode

looking at this it seems that your input fasta file and input gff file are not 'in sync'. it looks like the sequence IDs used in the GFF file do not correspond to the one present in the fasta file.

Those @SQ SN:**bin1_05320** LN:293 should be the same ID as in **k141_1006** 920 2257 + bin1_00002 , which seems to be clearly different. SO you need to make sure you're working with the correct files.

Moreover looking at the length of the sequences in your BAM file , thye seem very short to be genomic sequences. Can it be you are using the transcriptome fasta file to map against? if so, you light be better of to directly use something like for instance Salmon on your transcriptome file.

Alternatively you can also use samtools stat option to simply count the number of reads aligning to a given transcript.

The FeatureCount approach requires you map against the genome and then use the GFF file to link those mappings to genes (as they are indicated in the GFF file)

ADD REPLY
0
Entering edit mode

In bam it starts as "@SQ SN:bin1_00001 LN:125" and in gff it starts as "1 k141_1006 920 2257 + bin1_00001". Do you mean I should make 2nd column of bam file as bin1_00001 instead of SN:bin1_00001 for all the rows. In this way it will match to geneID column of gff file. Please correct me If i am wrong.

Bam File

@HD VN:1.5  SO:coordinate   GO:query
@SQ SN:bin1_00001   LN:125
@SQ SN:bin1_00002   LN:274
@SQ SN:bin1_00003   LN:248
@SQ SN:bin1_00004   LN:30
@SQ SN:bin1_00005   LN:358
@SQ SN:bin1_00006   LN:142

Yes the bam file I got it from my transcriptomes. I removed rRNA and contaminants from my RNA fastq files and then mapped my non_rRNA fastq files against the fasta file (obtained from my privately annotated genome).

I want to quantify the expression of different genes from transcriptomics data. Will salmon and samtools stat (instead of using feature count) help me to get number of reads assigned to each gene. As I studied salmon requires target transcriptome to build index. I have only my transcriptomes fastq files (after trimming and removing rRNA reads) to map to. Do I need to convert my sorted bam file back to fastq file and gtf file to fasta file (transcript.fa) and then run salmon. Please help me to understand If I am wrong.

By this "Feature count requires you to map against the genome": Do you mean mapping the transcriptome file with fasta file (my annotated genome) using feature count. I actually mapped using bowtie2. I am using feature count directly to count reads (after bowtie mapping process). Am I doing something wrong?

ADD REPLY
1
Entering edit mode

Do you mean I should make 2nd column of bam file as bin1_00001 instead of SN:bin1_00001 for all the rows.

no, no. You need to make sure that the IDs from the fasta file/GFF file are the same. so the part behind SN: in the bam file should be the first thing in the fasta file after each > in the fasta header.

You never map to a fastq file (those are the reads, which you want to count), but always to a fasta file (either genome or transcriptome).

To work with the FeatureCount approach you will need to map your fastq files to the genome fasta file (that you assembled/annotated). the resulting bam file you can then use as input, together with the correct GFF file, for FeatureCount.

Is the first column in your GFF file really 1 2 3 4 ... or is that a copy/paste issue? That first column in GFF file should match the SN: ID. If that is not the case, don't start editing the BAM file but rerun the analysis with the correct files.

But still, looking at the length of those sequences in the BAM file (last column is the length) they do not look, length-wise, as being genomic sequences (or your assembly is very poor and thus very fragmented). They look more like lengths of genes/CDSs/transcripts ...

ADD REPLY

Login before adding your answer.

Traffic: 3009 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6