Question

featureCounts log: "WARNING contig is not found in the provided genome file!"

0

Entering edit mode

4.8 years ago

Andrés Ribone ▴ 60

Hi, I'm working with sunflower (Helianthus annuus) RNAseq data. I mapped my read to the XRQ genome, and then tried to count reads using featureCounts.

gff=/path/to/file/HanXRQr1.0-20151230-EGN-r1.2.gff3
genome=/path/to/file/HanXRQr1.0-20151230.fa
bamdir=/path/to/bams/

featureCounts -T 40 -g ID -t gene -a $gff -p --primary --byReadGroup -J -G $genome-o all.pe.gene $bamdir/*.bam > all.pe.gene.log 2>&1 ;

The software runs without problem and produces the result files, but at the end of the log file, the penultimate line exactly, I got the warning:

WARNING contig 'HanXRQCP' is not found in the provided genome file!

The HanXRQCP chromosome (chloroplastic chromosome) is in the genome fasta file and in the gff file.
The results file show the chloroplastic genes normally, with counts and all, so the program is correctly reading the HanXRQCP in the fasta, in the bams, and in the gff.

The only thing I noted is that the order of the chromosomes in the .fasta, and in the .gff is different

Contig order in genome.fasta:

Han 1 - 17 croms
Han 1512 fragmentos
MT
CP

Contig order in .gff:

CP
Han 1512 fragmentos
Han 1 - 17 croms
MT

¿Any clue what is going on?¿Should I trust the results?

Thanks in advance

RNA-Seq featureCounts SubRead • 1.9k views

ADD COMMENT • link updated 4.8 years ago by dbpzdbpz ▴ 210 • written 4.8 years ago by Andrés Ribone ▴ 60

1

Entering edit mode

The WARNING message indicates that featureCounts didn't find the HanXRQCP sequence in the fasta file but it is mentioned in the BAM file.

Can you try this:

cat /path/to/file/HanXRQr1.0-20151230.fa | grep HanXRQCP

Maybe the format is a little tricky so featureCounts misinterpreted the sequence name.

ADD REPLY • link 4.8 years ago by dbpzdbpz ▴ 210

0

Entering edit mode

Is the chromosome/config name identical in both files? Does it contain spaces like HanXRQCP 123456bp further information? Are the names from the alignment altered?

ADD REPLY • link 4.8 years ago by michael.ante ★ 3.9k

score 1 · Answer 1 · 2020-02-10

I downloaded HanXRQr1.0-20151230.fa (I assume that it is the scaffold file on www.heliagene.org). Indeed, it has spaces in the contig names like

>HaXRQ-20151104-s00077 len=1683791

FeatureCounts treated the whole string, including "len=....", as the contig name, hence it cannot match the contig names with the contig names in the BAM file or in the annotation file.

I have amended the code to remove the part of the contig names after the space characters. The next release will contain this change. Before that, you can manually remove the extra information from the contig names:

cat HaXRQ-20151104.fa |sed 's/ len=[0-9]*$//g' > HaXRQ-20151104-fixed.fa

and use HaXRQ-20151104-fixed.fa for running featureCounts.

score 0 · Answer 2 · 2020-01-30

Did you sort your bam by readname or by coordinates? You can try sorting as per the required input to feature counts.

If your contig is not too large. You can upload the bam in igv and manually count the reads and correlate with featurecounts output.

If it is too large run bedtools coverage to see how many reads overlap within your contig of interest.