Hi, I am really new to bioinformatics so please help me figure out this error. I have tried looking at other threads with similar questions but couldn't resolve my problem.
I get the following error when I am using HTseq for counting:
Error occured when reading beginning of SAM/BAM file.
('SAM line does not contain at least 11 tab-delimited fields.',
'line 1 of file Sorted_KKUGCTM5_-VE_14-4_aligned.bam')
[Exception type: ValueError, raised in _HTSeq.pyx:1276]
The code I am using is as follows:
/HTSeq-0.6.1/scripts/htseq-count --stranded=no Sorted_KKUGCTM5_-VE_14-4_aligned.bam GRCh38/Homo_sapiens.GRCh38.90.gtf
I have tried using the above code, then I also tried converting the above bam file to sam file, and then used the sam file in the above-mentioned code, and I still get the following error:
Warning: Malformed SAM line: MRNM != '*' although flag bit &0x0008 set
Warning: Read 700463F:369:CB93CANXX:5:1103:15415:8299 claims to have an aligned mate which could not be found in an adjacent line.
Error occured when processing SAM input (line 57 of file Sorted_KKUGCTM5_-VE_14-4_aligned1.sam):
'pair_alignments' needs a sequence of paired-end alignments
[Exception type: ValueError, raised in __init__.py:603]
Please help where am I going wrong?
Also note that the default input format for htseq-count is sam, which is why the software yelled at you for not providing the right format. Rather than converting your data to .sam, you should tell htseq-count to expect .bam format.
What is the parameter I should introduce to specify that it is a bam file in the input? Shall I add -f bam in my HTseq code?
thanks it got figured out by updating python! and by specifying input as bam and specifying that it is sorted by name. thanks to everyone.
Thanks for pointing out how I should format my questions, I have done that now. With regards to the sorting by position or name, I believe the conversion of sam file to bam file and its subsequent sorting using the samtools code leads to sorting by name by default, so I am assuming mine is sorted by name because I continued with default parameters itself. Then in the Htseq website said that the default in Htseq is also by name so I didn't add that argument in my code. I still tried it now but the same error is coming up :(
No. Default is co-ordinate sort. Here is
samtools sort
help.Can you show what
samtools view -H your.bam | head -5
looks like?this is what it looks like. is this the expected outcome? Thanks. I have added the <-n> in my code as well. The above output is what it looks like after I ran the code with <-n>
Looks like you file is sorted based on the sequence headers (names). So you will need to use
-r name
option as noted above by @michael.ante with htseq.Hi Genomax, could you kindly take a moment to explain this to me. When I run the command
Then I get the following output:
First, is this how the expected output is? Second, does that mean that in my bam file I only have reads which have mapped to chromosomes 1, 10, 11, 12, 13, 14, 15, 16, 17? I don't understand what this means...
That is what the header of a bam file should look like. Have you looked up what the 'head' command does?
Thank you for asking me that question. Due to my lack of coding knowledge it didn't even strike me that head is a command. I looked it up and found that "head" only returns the first 10 lines. Also, I learnt about "tail, less and more" commands. And upon running the less and more commands I found all the chromosomes in the output. When I run
Then the output looks like this
This looks fine to me now. Thanks again!