I am using HTseq to perform counts and I'm encountering the following error, and my output .txt file is empty.
100000 GFF lines processed.
200000 GFF lines processed.
300000 GFF lines processed.
400000 GFF lines processed.
500000 GFF lines processed.
600000 GFF lines processed.
700000 GFF lines processed.
800000 GFF lines processed.
900000 GFF lines processed.
1000000 GFF lines processed.
1100000 GFF lines processed.
1200000 GFF lines processed.
1300000 GFF lines processed.
1400000 GFF lines processed.
1500000 GFF lines processed.
1600000 GFF lines processed.
1700000 GFF lines processed.
1800000 GFF lines processed.
1900000 GFF lines processed.
2000000 GFF lines processed.
2100000 GFF lines processed.
2200000 GFF lines processed.
2300000 GFF lines processed.
2400000 GFF lines processed.
2500000 GFF lines processed.
2600000 GFF lines processed.
2612843 GFF lines processed.
Error occured when processing SAM input (record #129 in file /extstor/sudhirlab/Reety/Bhavana/Set4_lastAgrigenome/Sorted_KKUGCTM5_+VE_12-5_aligned.bam):
Sequence of paired-end alignments expected, but got single-end alignment.
[Exception type: ValueError, raised in __init__.py:761]
My HTseq command was:
python2.7 -m HTSeq.scripts.count -f bam -r pos -s no /extstor/sudhirlab/Reety/Bhavana/Set4_lastAgrigenome/Sorted_KKUGCTM5_+VE_12-5_aligned.bam /extstor/sudhirlab/Reety/Bhavana/Indexed_Genomes/GRCh38/ChromosomeNamed_Gtf_GRCh38.gtf > /extstor/sudhirlab/Reety/Bhavana/Set4_lastAgrigenome/Counts_Sorted_KKUGCTM5_+VE_12-5_aligned.txt
The input file in this HTseq command is a coordinate sorted BAM file which I generated using the following code:
/extstor/softwares/samtools-1.9/samtools sort -@ 20 -o /extstor/sudhirlab/Reety/Bhavana/Set4_lastAgrigenome/Sorted_KKUGCTM5_+VE_12-5_aligned.bam /extstor/sudhirlab/Reety/Bhavana/Set4_lastAgrigenome/KKUGCTM5_+VE_12-5_aligned.bam
At the very beginning, I had used Trimmomatic to perform adapter trimming, followed by alignment by Hisat2 and then I used samtools view command to convert SAM to BAM. Then I performed the above mentioned sorting and counting. Neither my trimming, alignment nor SAM to BAM conversion have given any error in the log files. I wonder why HTseq giving this error.
I found these two threads on biostars but still cannot understand HT-seq error and empty result ----here it says something about processing of clean_data
Error at the end of HT-seq Analysis and output.txt with 0 bytes in size ---- here it has been recommended to try the following code:
samtools view -bf 1 foo.bam > foo.paired-end.bam
samtools view -bF 1 foo.bam > foo.single-end.bam
However I am unable to understand why this is not given in the htseq documentation. https://htseq.readthedocs.io/en/release_0.10.0/count.html. Am I missing something? Why is the above additional code required, could someone please explain that?
I have been struggling with this problem and would appreciate any help! Thank you so much.
UPDATE: Two strange things I found:
When I perform the exact same steps as above, but in the Sorting step, I sort by name by adding the "-n" argument, and then subsequently I use this sorted file in the above HTseq command by adding the "-r name" argument, then I get an output which looks like it has worked. Here is how to output looks like
Several lines of
ENSG00000000003 0
ENSG00000000005 0
ENSG00000000419 772
ENSG00000000457 153
ENSG00000000460 178
ENSG00000000938 0
ENSG00000000971 503
ENSG00000001036 697
ENSG00000001084 1364
And in the end of the file this is what I get
__no_feature 4027949
__ambiguous 889839
__too_low_aQual 0
__not_aligned 503244
__alignment_not_unique 5989749
Does this look like the expected output? If yes, then why is it working with name sorted BAM file but not the default sorted BAM file?
I also used the same code using raw files (fastq.gz) that were already given to me after adapter trimming. So for those files, I just had two files for each sample <sample1_read1_trimmed> and <sample1_read2_trimmed>. I had used these two reads for alignment with Hisat2, and then SAM to BAM conversion, then sorting of BAM file by coordinate, then I ran the HTseq command and I got an output which looks like the code worked. Here it is:
Several lines of
ENSG00000000003 0
ENSG00000000005 0
ENSG00000000419 1504
ENSG00000000457 409
ENSG00000000460 224
ENSG00000000938 0
ENSG00000000971 827
And in the end of the file this is what I get
__no_feature 3952696
__ambiguous 2186111
__too_low_aQual 1112784
__not_aligned 235455
__alignment_not_unique 1318607
So now I am confused that how is HTseq working with coordinate-sorted BAM file from these raw files, while it is not working for the other coordinate-sorted BAM files, where I performed my own trimming with Trimmomatic. The trimming was successful and gave no error. The alignment code for these trimmed files was slightly different because I added the -U "unpaired_trimmed" in the alignment command as suggested by Hisat2 documentation for trimmomatic outputs. However, the alignments were all successful with >95% alignment. The resulting SAM file was converted to BAM file and then the BAM file looks fine too. Here is what I get when I run the following command for my unsorted BAM file:
samtools view -H KKUGCTM5_-VE_14-4_aligned.bam | less
@HD VN:1.0 SO:unsorted
@SQ SN:chr1 LN:248956422
@SQ SN:chr10 LN:133797422
@SQ SN:chr11 LN:135086622
@SQ SN:chr12 LN:133275309
@SQ SN:chr13 LN:114364328
@SQ SN:chr14 LN:107043718
@SQ SN:chr15 LN:101991189
@SQ SN:chr16 LN:90338345
@SQ SN:chr17 LN:83257441
@SQ SN:chr18 LN:80373285
@SQ SN:chr19 LN:58617616
@SQ SN:chr2 LN:242193529
@SQ SN:chr20 LN:64444167
@SQ SN:chr21 LN:46709983
@SQ SN:chr22 LN:50818468
@SQ SN:chr3 LN:198295559
@SQ SN:chr4 LN:190214555
@SQ SN:chr5 LN:181538259
@SQ SN:chr6 LN:170805979
@SQ SN:chr7 LN:159345973
@SQ SN:chr8 LN:145138636
@SQ SN:chr9 LN:138394717
@SQ SN:chrX LN:156040895
@SQ SN:chrY LN:57227415
@SQ SN:chrMT LN:16569
@PG ID:hisat2 PN:hisat2 VN:2.1.0 CL:"/extstor/softwares/hisat2-2.1.0/hisat2-align-s --wrapper basic-0 -q -p 8 --min-intronlen 50 --max-intronlen 250000--dta-cufflinks --new-summary --summary-file ./summary_KKUGCTM5_-VE_14-4.txt -x /extstor/sudhirlab/Reety/Bhavana/Indexed_Genomes/GRCh38/Index_Hisat_GRCh38 -S ./KKUGCTM5_-VE_14-4_aligned.sam -1 /tmp/212690.inpipe1 -2 /tmp/212690.inpipe2 -U /tmp/212690.unp"
Is that the output from a single 'run' ? it looks like the upper part is a copy of the lower part
Oh sorry, I had run two samples together and this was what came in the error log file for both. I will delete the lower part so that it reflects what happened with one sample only. Thanks for pointing that out.
I have done the needful. I have also updated the bottom part of my question. If you could help me figure this out, I'd be really grateful. Thanks.