HTSeq error when reading BAM file
1
2
Entering edit mode
8.2 years ago
Mike ★ 1.9k

Hi,

I am running HTseq for read count with STAR output BAM file, but I got following errors:

STAR --genomeDir /databank/igenomes/Homo_sapiens/STAR --readFilesIn R1.fastq --runThreadN 8 --outFileNamePrefix R1a --outSAMtype BAM Unsorted

htseq-count -m union -i gene_name R1aAligned.out.bam hg19.gtf > output_usingbam.counts

Error occured when reading first line of sam file.
  ('SAM line does not contain at least 11 tab-delimited fields.', 'line 1 of file R1aAligned.out.bam')
  [Exception type: ValueError, raised in _HTSeq.pyx:1276]

But no erorr when I use default STAR output (SAM ) file in htseq-count :

STAR --genomeDir /databank/igenomes/Homo_sapiens/STAR --readFilesIn R1.fastq --runThreadN 8 --outFileNamePrefix R1a

htseq-count -m union -i gene_name R1aAligned.out.sam hg19.gtf > output_usingsam.counts

it run successfully.

And I also got error when I sorted SAM file.

Please help where is problem.

Thanks

rna-seq htseq • 7.2k views
ADD COMMENT
2
Entering edit mode
8.2 years ago
michael.ante ★ 3.9k

Hi Mike,

according to the HTSeq-count manual, you have to set the -f parameter:

 -f <format>, --format=<format>

     Format of the input data. Possible values are sam (for text SAM files) and bam (for binary BAM files). Default is sam.

By doing so, you also need to specify if your data is either position sorted or name sorted:

-r <order>, --order=<order>

 ...  Use this option, with name or pos for <order> to indicate how the input data has been sorted. The default is name.

Cheers,

Michael

ADD COMMENT
0
Entering edit mode

Thanks Michael,

my bam file is not sorted, so I used following command, but still got error:

htseq-count -m union -i gene_name -f bam R1aAligned.out.bam hg19.gtf > output_usingbam.counts

htseq-count: error: no such option: -f

ADD REPLY
3
Entering edit mode

Maybe you should update your htseq program.

You might also try the gene counting from STAR itself (it works with STAR 2.5.2). Setting the parameter --quantMode GeneCounts will produce a file which "counts coincide with those produced by htseq-count with default parameters (STAR Manual 2.5.2a)".

ADD REPLY
0
Entering edit mode

Thanks,

Updated htseq, but.. still error

100000 GFF lines processed.
200000 GFF lines processed.
300000 GFF lines processed.
400000 GFF lines processed.
500000 GFF lines processed.
600000 GFF lines processed.
700000 GFF lines processed.
800000 GFF lines processed.
900000 GFF lines processed.
1000000 GFF lines processed.
1100000 GFF lines processed.
1200000 GFF lines processed.
1300000 GFF lines processed.
1400000 GFF lines processed.
1500000 GFF lines processed.
1600000 GFF lines processed.
1700000 GFF lines processed.
1800000 GFF lines processed.
1900000 GFF lines processed.
2000000 GFF lines processed.
2100000 GFF lines processed.
2200000 GFF lines processed.
2215898 GFF lines processed.
Error occured when reading beginning of SAM/BAM file.
  file header is empty (mode='rb') - is it SAM/BAM format?
  [Exception type: ValueError, raised in calignmentfile.pyx:574]
ADD REPLY
0
Entering edit mode

Has your SAM file the correct header (samtools view -SH R1aAligned.out.sam) and does it match the gtf's chromosome-names? What version of STAR are you using and what is the output of the R1aLog.final.out?

ADD REPLY
0
Entering edit mode

Thanks, now it works,

it was problem in STAR output sam file,

Could you please check, alignment results, Is it OK?

This is R1aLog.final.out file:

                          Started job on |       Sep 22 14:30:25
                     Started mapping on |       Sep 22 14:56:52
                            Finished on |       Sep 22 14:58:27

Mapping speed, Million of reads per hour | 147.32

                  Number of input reads |       3887667
              Average input read length |       72
                            UNIQUE READS:
           Uniquely mapped reads number |       3180832
                Uniquely mapped reads % |       81.82%
                  Average mapped length |       71.33
               Number of splices: Total |       247696
    Number of splices: Annotated (sjdb) |       0
               Number of splices: GT/AG |       242886
               Number of splices: GC/AG |       1339
               Number of splices: AT/AC |       58
       Number of splices: Non-canonical |       3413
              Mismatch rate per base, % |       0.86%
                 Deletion rate per base |       0.01%
                Deletion average length |       1.67
                Insertion rate per base |       0.01%
               Insertion average length |       1.43
                     MULTI-MAPPING READS:
Number of reads mapped to multiple loci |       388865
     % of reads mapped to multiple loci |       10.00%
Number of reads mapped to too many loci |       11128
     % of reads mapped to too many loci |       0.29%
                          UNMAPPED READS:

% of reads unmapped: too many mismatches | 0.00% % of reads unmapped: too short | 7.66% % of reads unmapped: other | 0.23%

ADD REPLY
1
Entering edit mode

It looks OK; you have ~ 92% mapping rate (82 % uniquely mapped reads + 10% multi-mapping reads). With HTSeq-count only the 82% will be processed.

You can check your raw fastq file for adapter contamination (e.g. with FastQC reporter) and maybe trim if found. You can also check your alignment with RSeQC (gene body coverage; strandedness, read distribution over annotation).

ADD REPLY
0
Entering edit mode

Thank you very much... highly appreciated for your nice explaination.

ADD REPLY
0
Entering edit mode

Hi Mike, I am having the same problem. You mentioned it was the problem of STAR output sam file. So how do you solve it? Thanks!

ADD REPLY

Login before adding your answer.

Traffic: 1899 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6