Entering edit mode
14 months ago
Bjorn
•
0
Hello. I ran this HTseq command
htseq-count -r name -t gene -i gene -s yes -f bam /Volumes/cachannel/ZebraFinchBrain/CB-4a_genomemapping/sorted_alignmentcb4a.bam /Volumes/cachannel/ZebraFinchBrain/GCF_003957565.2/ncbi_dataset/data/GCF_003957565.2/genomic.gff > /Volumes/cachannel/ZebraFinchBrain/HTSEQ_withautomate/output_counts.txt
and got the error
Warning: 72583723 reads with missing mate encountered.
80015507 alignment record pairs processed.
Is there a setting I am missing to process this differently?
Did you name re-sort your BAM file (as discussed in other thread)? You are using
-r name
in this command line. The error above otherwise makes no sense.Did you by chance trim/scan the paired-end reads that went into the alignment independently? If so your paired-end files were likely out of sync.
Hey, I did try this changed command.
And I still got this
I believe that I scanned these together. I used this code for that process
Would this cause them to be out of sync somehow?
The reads should be in sync since you did trim them as paired files.
Have you looked at these alignments in a genome viewer? Do the alignments look reasonable i.e. reads piling up under exons. Is every gene in your file 0 counts? Are the identifiers for chromosomes matching in your reference and annotation file?
Yes, every gene in my file has 0 counts.
I believe you have helped my identify the problem. I created a reference genome of .ht2 files from a .fna file that has the identifiers
But the .gff file I have has NC_044211.2 as the identifiers.
Is there a way to account for this? Or will I have to find a .fna file that has the same identifiers, make a new genome, rerun the alignment, then run the read counting?
This is not going to work with any program (featureCounts or htseq-count). Your chromosome ID's need to match in your reference and the annotation.
I see a good reference genome for the finch here: https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_003957565.2/
Click on the
Download
button and getfasta
sequence and theGTF
file. Everything should match in this case. You will need to re-index the genome and re-do the alignments.Thank you. I will focus on making a new genome index. However, I am running into a problem with the hisat2-build funtion where it is not recognizing my A nucleotides?
This is my input hisat2-build -p 16 /Volumes/cachannel/ZebraFinchBrain/GCF_003957565.2/ncbi_dataset/data/GCF_003957565.2/GCF_003957565.2_bTaeGut1.4.pri_genomic.fna /Volumes/cachannel/ZebraFinchBrain/reftwo/referencegenome2
This is my output after a while of building
Exited GFM loop fchr[A]: 0 fchr[C]: 306050462 fchr[G]: 526435244 fchr[T]: 746725372 fchr[$]: 1052636474 Exiting GFM::buildToDisk() Returning from initFromVector Wrote 355107337 bytes to primary GFM file: /Volumes/cachannel/ZebraFinchBrain/reftwo/referencegenome2.1.ht2 Wrote 263159124 bytes to secondary GFM file: /Volumes/cachannel/ZebraFinchBrain/reftwo/referencegenome2.2.ht2 Re-opening _in1 and _in2 as input streams Returning from GFM constructor Returning from initFromVector Wrote 462917771 bytes to primary GFM file: /Volumes/cachannel/ZebraFinchBrain/reftwo/referencegenome2.5.ht2 Wrote 267955020 bytes to secondary GFM file: /Volumes/cachannel/ZebraFinchBrain/reftwo/referencegenome2.6.ht2 Re-opening _in5 and _in5 as input streams Returning from HGFM constructor Headers: len: 1052636474 gbwtLen: 1052636475 nodes: 1052636475 sz: 263159119 gbwtSz: 263159119 lineRate: 6 offRate: 4 offMask: 0xfffffff0 ftabChars: 10 eftabLen: 0 eftabSz: 0 ftabLen: 1048577 ftabSz: 4194308 offsLen: 65789780 offsSz: 263159120 lineSz: 64 sideSz: 64 sideGbwtSz: 48 sideGbwtLen: 192 numSides: 5482482 numLines: 5482482 gbwtTotLen: 350878848 gbwtTotSz: 350878848 reverse: 0 linearFM: Yes
This did eliminate the missing mate error though.
Do yourself a favor and use featureCounts, it is notably faster, more flexible and can sort reads automatically if paired-end is set and sort order is position.
Thank you. I am looking into this now.