Question

featureCounts - annotation file issue

0

Entering edit mode

5.5 years ago

gokberk ▴ 90

Hi all,

I've been using featureCounts to generate count tables out of my bam files. Previously, it worked fine with bam files which I generated with Subread.

Now, I'm using featureCounts with the bam files I generated with HiSAT2. However, non of the alignments were assigned to any genes, since the chromosome names in my gtf file and bam files were different. I changed the chromosome names in my bam files following the instructions in this post. However, the bam file I generate following this method turns out to be corrupted somehow. samtools view mybam.bam | head command does not give any output and when I run featureCounts, I receive "GZIP ERROR: -5" and still non of the alignments gets assigned to a gene. So I wonder how I can fix this discrepancy between my bam files and gtf file.

I used awk to format the header file and changed all chromosome names accordingly, but it didn't fix the issue. However, when I change chromosome names, blanks between columns change as well for some reason, meaning if there was a tab, it turns into a single space. So, I wonder if there is another way of solving this issue. I would be more than happy if you could help me out.

Here is how my gtf, header and old bam files look right now:

gtf:

NC_005943.1 RefSeq  CDS 3259    4213    .   +   0   transcript_id "gene34201"; gene_id "gene34201"; gene_name "ND1";
NC_005943.1 RefSeq  CDS 4421    5462    .   +   0   transcript_id "gene34202"; gene_id "gene34202"; gene_name "ND2";
NC_005943.1 RefSeq  CDS 5850    7418    .   +   0   transcript_id "gene34203"; gene_id "gene34203"; gene_name "COX1";
NC_005943.1 RefSeq  CDS 7532    8215    .   +   0   transcript_id "gene34204"; gene_id "gene34204"; gene_name "COX2";
NC_005943.1 RefSeq  CDS 8357    8563    .   +   0   transcript_id "gene34205"; gene_id "gene34205"; gene_name "ATP8";
NC_005943.1 RefSeq  CDS 8518    9198    .   +   0   transcript_id "gene34206"; gene_id "gene34206"; gene_name "ATP6";

old header file:

@HD VN:1.0  SO:coordinate
@SQ SN:chr1 LN:225584828
@SQ SN:chr10    LN:92844088
@SQ SN:chr11    LN:133663169
@SQ SN:chr12    LN:125506784
@SQ SN:chr13    LN:108979918
@SQ SN:chr14    LN:127894412
@SQ SN:chr15    LN:111343173
@SQ SN:chr16    LN:77216781
@SQ SN:chr17    LN:95684472
@SQ SN:chr18    LN:70235451
@SQ SN:chr19    LN:53671032
@SQ SN:chr2 LN:204787373
@SQ SN:chr20    LN:74971481
@SQ SN:chr3 LN:185818997
@SQ SN:chr4 LN:172585720
@SQ SN:chr5 LN:190429646
@SQ SN:chr6 LN:180051392
@SQ SN:chr7 LN:169600520
@SQ SN:chr8 LN:144306982
@SQ SN:chr9 LN:129882849
@SQ SN:chrM LN:16564
@SQ SN:chrUn_NW_014806053v1 LN:1653

fixed header file:

@HD    VN:1.0  SO:coordinate
@SQ SN:NC_041754.1  LN:225584828
@SQ SN:NC_041763.1  LN:92844088
@SQ SN:NC_041764.1  LN:133663169
@SQ SN:NC_041765.1  LN:125506784
@SQ SN:NC_041766.1  LN:108979918
@SQ SN:NC_041767.1  LN:127894412
@SQ SN:NC_041768.1  LN:111343173
@SQ SN:NC_041769.1  LN:77216781
@SQ SN:NC_041770.1  LN:95684472
@SQ SN:NC_041771.1  LN:70235451
@SQ SN:NC_041772.1  LN:53671032
@SQ SN:NC_041755.1  LN:204787373
@SQ SN:NC_041773.1  LN:74971481
@SQ SN:NC_041756.1  LN:185818997
@SQ SN:NC_041757.1  LN:172585720
@SQ SN:NC_041758.1  LN:190429646
@SQ SN:NC_041759.1  LN:180051392
@SQ SN:NC_041760.1  LN:169600520
@SQ SN:NC_041761.1  LN:144306982
@SQ SN:NC_041762.1  LN:129882849
@SQ SN:NC_005943.1  LN:16564
@SQ SN:NW_014806053.1   LN:1653
@SQ SN:NW_014806054.1   LN:997
@SQ SN:NW_014806055.1   LN:843

old bam:

SRR5991098.18949202 0   chr1    42294   60  40M *   0   0   GCCCTGCCTCCCACAGCTTTATTTCTTTTGGTTTTGGATG    HHHJJJJJJJJJJJJJJJJJIJJJIJJJJJIGIJJJIIJJ    AS:i:0  XN:i:0  XM:i:0XO:i:0    XG:i:0  NM:i:0  MD:Z:40 YT:Z:UU NH:i:1
SRR5991098.2109594  0   chr1    42295   60  40M *   0   0   CCCTGCCTCCCACAGCTTTATTTCTTTTGGTTTTGGATGC    HHGJJJJJJJJIJCHIIJJGIIJIJJJJIIFHHIGICHFH    AS:i:0  XN:i:0  XM:i:0XO:i:0    XG:i:0  NM:i:0  MD:Z:40 YT:Z:UU NH:i:1
SRR5991098.16345197 0   chr1    42301   60  40M *   0   0   CTCCCACAGCTTTATTTCTTTTGGTTTTGGATGCAAAACA    D:ADEDA;;=+AEEFFEECBEDI<D;>1B<>DD9DDEDAD    AS:i:0  XN:i:0  XM:i:0XO:i:0    XG:i:0  NM:i:0  MD:Z:40 YT:Z:UU NH:i:1
SRR5991098.31884339 0   chr1    42303   60  40M *   0   0   CCCACAGCTTTATTTCTTTTGGTTTTGGATGCAAAACAAA    HHHJJJJJJJJJJJJIJJJJJJHIIJIJIIJJJJJJJJJJ    AS:i:0  XN:i:0  XM:i:

Cheers!

RNA-Seq • 2.5k views

ADD COMMENT • link 5.5 years ago by gokberk ▴ 90

score 4 · Accepted Answer · 2019-05-15

4

Entering edit mode

5.5 years ago

GouthamAtla 12k

I would change chromosome names in GTF which is also computationally efficient.

ADD COMMENT • link 5.5 years ago by GouthamAtla 12k

0

Entering edit mode

Thanks for the advice geek_y! While I was trying to do what you suggested, I realized that the chromosome names in my gtf file and the chromosome names that are given at NCBI's website that I downloaded this gtf file do not match. So, I found the correct chromosome name from the gft file itself and it fixed my problem. Thanks again!

ADD REPLY • link 5.5 years ago by gokberk ▴ 90

0

Entering edit mode

Instead of closing the question, please mark the answer as accepted to indicate that it solved your problem. That will help others in the future.

enter image description here

ADD REPLY • link 5.5 years ago by ATpoint 85k