Question

Error upon running HTseq with paired end aligned BAM (coordinate sorted) files

0

Entering edit mode

6.2 years ago

bnayer26 • 0

I am using HTseq to perform counts and I'm encountering the following error, and my output .txt file is empty.

100000 GFF lines processed.
200000 GFF lines processed.
300000 GFF lines processed.
400000 GFF lines processed.
500000 GFF lines processed.
600000 GFF lines processed.
700000 GFF lines processed.
800000 GFF lines processed.
900000 GFF lines processed.
1000000 GFF lines processed.
1100000 GFF lines processed.
1200000 GFF lines processed.
1300000 GFF lines processed.
1400000 GFF lines processed.
1500000 GFF lines processed.
1600000 GFF lines processed.
1700000 GFF lines processed.
1800000 GFF lines processed.
1900000 GFF lines processed.
2000000 GFF lines processed.
2100000 GFF lines processed.
2200000 GFF lines processed.
2300000 GFF lines processed.
2400000 GFF lines processed.
2500000 GFF lines processed.
2600000 GFF lines processed.
2612843 GFF lines processed.
Error occured when processing SAM input (record #129 in file /extstor/sudhirlab/Reety/Bhavana/Set4_lastAgrigenome/Sorted_KKUGCTM5_+VE_12-5_aligned.bam):
  Sequence of paired-end alignments expected, but got single-end alignment.
  [Exception type: ValueError, raised in __init__.py:761]

My HTseq command was:

python2.7 -m HTSeq.scripts.count -f bam -r pos -s no /extstor/sudhirlab/Reety/Bhavana/Set4_lastAgrigenome/Sorted_KKUGCTM5_+VE_12-5_aligned.bam /extstor/sudhirlab/Reety/Bhavana/Indexed_Genomes/GRCh38/ChromosomeNamed_Gtf_GRCh38.gtf > /extstor/sudhirlab/Reety/Bhavana/Set4_lastAgrigenome/Counts_Sorted_KKUGCTM5_+VE_12-5_aligned.txt

The input file in this HTseq command is a coordinate sorted BAM file which I generated using the following code:

/extstor/softwares/samtools-1.9/samtools sort -@ 20 -o /extstor/sudhirlab/Reety/Bhavana/Set4_lastAgrigenome/Sorted_KKUGCTM5_+VE_12-5_aligned.bam /extstor/sudhirlab/Reety/Bhavana/Set4_lastAgrigenome/KKUGCTM5_+VE_12-5_aligned.bam

At the very beginning, I had used Trimmomatic to perform adapter trimming, followed by alignment by Hisat2 and then I used samtools view command to convert SAM to BAM. Then I performed the above mentioned sorting and counting. Neither my trimming, alignment nor SAM to BAM conversion have given any error in the log files. I wonder why HTseq giving this error.

I found these two threads on biostars but still cannot understand HT-seq error and empty result ----here it says something about processing of clean_data

Error at the end of HT-seq Analysis and output.txt with 0 bytes in size ---- here it has been recommended to try the following code:

samtools view -bf 1 foo.bam > foo.paired-end.bam
samtools view -bF 1 foo.bam > foo.single-end.bam

However I am unable to understand why this is not given in the htseq documentation. https://htseq.readthedocs.io/en/release_0.10.0/count.html. Am I missing something? Why is the above additional code required, could someone please explain that?

I have been struggling with this problem and would appreciate any help! Thank you so much.

UPDATE: Two strange things I found:

When I perform the exact same steps as above, but in the Sorting step, I sort by name by adding the "-n" argument, and then subsequently I use this sorted file in the above HTseq command by adding the "-r name" argument, then I get an output which looks like it has worked. Here is how to output looks like

   Several lines of
        ENSG00000000003 0
        ENSG00000000005 0
        ENSG00000000419 772
        ENSG00000000457 153
        ENSG00000000460 178
        ENSG00000000938 0
        ENSG00000000971 503
        ENSG00000001036 697
        ENSG00000001084 1364 
And in the end of the file this is what I get
        __no_feature    4027949
        __ambiguous 889839
        __too_low_aQual 0
        __not_aligned   503244
        __alignment_not_unique  5989749

Does this look like the expected output? If yes, then why is it working with name sorted BAM file but not the default sorted BAM file?

I also used the same code using raw files (fastq.gz) that were already given to me after adapter trimming. So for those files, I just had two files for each sample <sample1_read1_trimmed> and <sample1_read2_trimmed>. I had used these two reads for alignment with Hisat2, and then SAM to BAM conversion, then sorting of BAM file by coordinate, then I ran the HTseq command and I got an output which looks like the code worked. Here it is:

Several lines of 
    ENSG00000000003 0
    ENSG00000000005 0
    ENSG00000000419 1504
    ENSG00000000457 409
    ENSG00000000460 224
    ENSG00000000938 0
    ENSG00000000971 827
 And in the end of the file this is what I get
    __no_feature    3952696
    __ambiguous 2186111
    __too_low_aQual 1112784
    __not_aligned   235455
    __alignment_not_unique  1318607

So now I am confused that how is HTseq working with coordinate-sorted BAM file from these raw files, while it is not working for the other coordinate-sorted BAM files, where I performed my own trimming with Trimmomatic. The trimming was successful and gave no error. The alignment code for these trimmed files was slightly different because I added the -U "unpaired_trimmed" in the alignment command as suggested by Hisat2 documentation for trimmomatic outputs. However, the alignments were all successful with >95% alignment. The resulting SAM file was converted to BAM file and then the BAM file looks fine too. Here is what I get when I run the following command for my unsorted BAM file:

samtools view -H KKUGCTM5_-VE_14-4_aligned.bam | less

@HD     VN:1.0  SO:unsorted
@SQ     SN:chr1 LN:248956422
@SQ     SN:chr10        LN:133797422
@SQ     SN:chr11        LN:135086622
@SQ     SN:chr12        LN:133275309
@SQ     SN:chr13        LN:114364328
@SQ     SN:chr14        LN:107043718
@SQ     SN:chr15        LN:101991189
@SQ     SN:chr16        LN:90338345
@SQ     SN:chr17        LN:83257441
@SQ     SN:chr18        LN:80373285
@SQ     SN:chr19        LN:58617616
@SQ     SN:chr2 LN:242193529
@SQ     SN:chr20        LN:64444167
@SQ     SN:chr21        LN:46709983
@SQ     SN:chr22        LN:50818468
@SQ     SN:chr3 LN:198295559
@SQ     SN:chr4 LN:190214555
@SQ     SN:chr5 LN:181538259
@SQ     SN:chr6 LN:170805979
@SQ     SN:chr7 LN:159345973
@SQ     SN:chr8 LN:145138636
@SQ     SN:chr9 LN:138394717
@SQ     SN:chrX LN:156040895
@SQ     SN:chrY LN:57227415
@SQ     SN:chrMT        LN:16569
@PG     ID:hisat2       PN:hisat2       VN:2.1.0        CL:"/extstor/softwares/hisat2-2.1.0/hisat2-align-s --wrapper basic-0 -q -p 8 --min-intronlen 50 --max-intronlen 250000--dta-cufflinks --new-summary --summary-file ./summary_KKUGCTM5_-VE_14-4.txt -x /extstor/sudhirlab/Reety/Bhavana/Indexed_Genomes/GRCh38/Index_Hisat_GRCh38 -S ./KKUGCTM5_-VE_14-4_aligned.sam -1 /tmp/212690.inpipe1 -2 /tmp/212690.inpipe2 -U /tmp/212690.unp"

HTSeq RNAseq • 3.3k views

ADD COMMENT • link updated 6.2 years ago by h.mon 35k • written 6.2 years ago by bnayer26 • 0

0

Entering edit mode

Is that the output from a single 'run' ? it looks like the upper part is a copy of the lower part

ADD REPLY • link 6.2 years ago by lieven.sterck 15k

0

Entering edit mode

Oh sorry, I had run two samples together and this was what came in the error log file for both. I will delete the lower part so that it reflects what happened with one sample only. Thanks for pointing that out.

ADD REPLY • link 6.2 years ago by bnayer26 • 0

0

Entering edit mode

I have done the needful. I have also updated the bottom part of my question. If you could help me figure this out, I'd be really grateful. Thanks.

ADD REPLY • link 6.2 years ago by bnayer26 • 0

score 1 · Answer 1 · 2019-02-28

1

Entering edit mode

6.2 years ago

h.mon 35k

You are mixing single-end and paired-end reads in the same alignment file, and HTSeq does not handle this. You already asked this, and I pointed this to you - but you deleted that question.

Look at the @PG line from your bam file (I wrapped it to increase readability):

@PG     ID:hisat2       PN:hisat2       VN:2.1.0        
CL:"/extstor/softwares/hisat2-2.1.0/hisat2-align-s --wrapper basic-0 -q -p 8 --min-intronlen 50 
      --max-intronlen 250000--dta-cufflinks --new-summary --summary-file ./summary_KKUGCTM5_-VE_14-4.txt 
      -x /extstor/sudhirlab/Reety/Bhavana/Indexed_Genomes/GRCh38/Index_Hisat_GRCh38 -S ./KKUGCTM5_-VE_14-4_aligned.sam 
      -1 /tmp/212690.inpipe1 -2 /tmp/212690.inpipe2 -U /tmp/212690.unp"

If you want to quantify read counts, do not mix single-end and paired-end in the same file.

Here is my previous comment:

I think HTSeq has no problem with single-end reads per se, the problem is mixing single-end and paired-end in the same file. Even if HTSeq didn't have a problem with this, you shouldn't mix single- and paired-end counts, because you may have a batch effect, due to the different "mappabilities" of each type of read.

ADD COMMENT • link 6.2 years ago by h.mon 35k

0

Entering edit mode

Thanks h.mon, yes I had read your comment on my previous question and I had responded to you as well. Sorry I deleted it eventually because I had mixed two questions in that one, and once I wrote this question down more clearly, the previous one was redundant. I understand when you talk about HTseq not handling single end and paired end reads; thanks for pointing that out...although I was reading the HTseq documentation and couldn't find any explanation for why this is the case. I still have two questions: (1) if HTseq is unable to process a file containing both paired end and single end reads, then how is HTseq able to give me an output when I put a name-sorted BAM file as my input?, (2) My data was originally paired-end only, I didn't do any single-end sequencing. I got a R1 and R2 for each sample, which I input in Trimmomatic and Trimmomatic gave me both paired and unpaired outputs for each read. Then I put all these into the Hisat2 alignment code as was suggested in the Hisat2 documentation. So what part of this process is problematic for Htseq?

ADD REPLY • link 6.2 years ago by bnayer26 • 0

1

Entering edit mode

Usually, the proportion of single-end reads after quality trimming and adapter removal is negligible compared to the remaining paired-end reads - if this is not the case, either you are trimming too aggressively, or your data is problematic. So I just discard the single-end remnants of the trimming process, and quantify only the paired-end reads.

So what part of this process is problematic for Htseq?

This is a question you should ask HTSeq authors. It is not so simple to count reads, there are a couple of issues already reported at the github issue tracker, if you read them, you will get a better understanding of the complexities around read counting:

Minor issue with position sorted bam missing mate

Issues with "Mate records missing" with position-sorted bam files

ADD REPLY • link 6.2 years ago by h.mon 35k

0

Entering edit mode

h.mon, I forgot to reply to your msg above but thanks a lot for sending those links - it really helped. I could not still get the HTseq to work seamlessly but it gives me some solace to know that the developer of the program himself recognises this as a problem. I will try other methods now.

ADD REPLY • link 6.1 years ago by bnayer26 • 0

0

Entering edit mode

Also, what does the @PG line indicate? I do not understand everything in that code per se, so if you please point out what exactly you were referring to, that would help. thank you :)

Also, how do you suggest I proceed with my alignment files in that case? Does any other counting software like featureCounts deal with this type of input better? Or can I add something to make my HTseq code run for the coordinate sorted BAM files?

For now, I am sorting my BAM files by name, then using that for HTSeq input file. Somehow this seems to work.

ADD REPLY • link 6.2 years ago by bnayer26 • 0

1

Entering edit mode

@PG is a header section with information on the program(s) used to create the SAM file. From the SAM format specifications:

@PG Program.

It supports several tags, like ID: (program record identifier), PN: (program name), CL (command line), and others. I used the CL tag to find out how you mapped your reads, and I found out you mapped single-end and paired-end reads simultaneously (-1 /tmp/212690.inpipe1 -2 /tmp/212690.inpipe2 -U /tmp/212690.unp).

ADD REPLY • link 6.2 years ago by h.mon 35k

score 0 · Answer 2 · 2019-02-28

0

Entering edit mode

6.2 years ago

swbarnes2 14k

Did you try running the little samtools lines of code? Did you look up what those lines do? They aren't in the tutorial, because troubleshooting problems isn't something that can really be covered in a tutorial.

ADD COMMENT • link 6.2 years ago by swbarnes2 14k