featureCounts log: "WARNING contig is not found in the provided genome file!"
2
0
Entering edit mode
4.9 years ago

Hi, I'm working with sunflower (Helianthus annuus) RNAseq data. I mapped my read to the XRQ genome, and then tried to count reads using featureCounts.

gff=/path/to/file/HanXRQr1.0-20151230-EGN-r1.2.gff3
genome=/path/to/file/HanXRQr1.0-20151230.fa
bamdir=/path/to/bams/

featureCounts -T 40 -g ID -t gene -a $gff -p --primary --byReadGroup -J -G $genome-o all.pe.gene $bamdir/*.bam > all.pe.gene.log 2>&1 ;

The software runs without problem and produces the result files, but at the end of the log file, the penultimate line exactly, I got the warning:

WARNING contig 'HanXRQCP' is not found in the provided genome file!
  • The HanXRQCP chromosome (chloroplastic chromosome) is in the genome fasta file and in the gff file.
  • The results file show the chloroplastic genes normally, with counts and all, so the program is correctly reading the HanXRQCP in the fasta, in the bams, and in the gff.

The only thing I noted is that the order of the chromosomes in the .fasta, and in the .gff is different

Contig order in genome.fasta:

  • Han 1 - 17 croms
  • Han 1512 fragmentos
  • MT
  • CP

Contig order in .gff:

  • CP
  • Han 1512 fragmentos
  • Han 1 - 17 croms
  • MT

¿Any clue what is going on?¿Should I trust the results?

Thanks in advance

RNA-Seq featureCounts SubRead • 1.9k views
ADD COMMENT
1
Entering edit mode

The WARNING message indicates that featureCounts didn't find the HanXRQCP sequence in the fasta file but it is mentioned in the BAM file.

Can you try this:

cat /path/to/file/HanXRQr1.0-20151230.fa | grep HanXRQCP

Maybe the format is a little tricky so featureCounts misinterpreted the sequence name.

ADD REPLY
0
Entering edit mode

Is the chromosome/config name identical in both files? Does it contain spaces like HanXRQCP 123456bp further information? Are the names from the alignment altered?

ADD REPLY
1
Entering edit mode
4.9 years ago
dbpzdbpz ▴ 220

I downloaded HanXRQr1.0-20151230.fa (I assume that it is the scaffold file on www.heliagene.org). Indeed, it has spaces in the contig names like

>HaXRQ-20151104-s00077 len=1683791

FeatureCounts treated the whole string, including "len=....", as the contig name, hence it cannot match the contig names with the contig names in the BAM file or in the annotation file.

I have amended the code to remove the part of the contig names after the space characters. The next release will contain this change. Before that, you can manually remove the extra information from the contig names:

cat HaXRQ-20151104.fa |sed 's/ len=[0-9]*$//g' > HaXRQ-20151104-fixed.fa

and use HaXRQ-20151104-fixed.fa for running featureCounts.

ADD COMMENT
0
Entering edit mode
4.9 years ago
Ankit ▴ 510

Did you sort your bam by readname or by coordinates? You can try sorting as per the required input to feature counts.

If your contig is not too large. You can upload the bam in igv and manually count the reads and correlate with featurecounts output.

If it is too large run bedtools coverage to see how many reads overlap within your contig of interest.

ADD COMMENT

Login before adding your answer.

Traffic: 2194 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6