Question

TCGA Analysis - Generating Count files using htseq

0

Entering edit mode

9.4 years ago

dgtaylor13 • 0

Hello Biostars community,

I am new to TCGA studies, and I would like to generate count files from not strand specific, paired end read bam files.

First, I sorted my bam files and converted them to sam files:

samtools sort -n data.bam data_sorted
samtools view -h data_sorted.bam > data_sorted.sam

Then I ran htseq-count:

htseq-count --stranded=no data_sorted.sam annotation.gff3 > data_sorted_counts.txt

When I execute this command, I receive the following output and error:

/u/local/python/2.6/lib64/python2.6/site-packages/HTSeq-0.5.3p3-py2.6-linux-x86_
64.egg/HTSeq/__init__.py:8: RuntimeWarning: numpy.flatiter size changed, may ind
icate binary incompatibility
  from _HTSeq import *
100000 GFF lines processed.
200000 GFF lines processed.
300000 GFF lines processed.
400000 GFF lines processed.
500000 GFF lines processed.
600000 GFF lines processed.
700000 GFF lines processed.
800000 GFF lines processed.
900000 GFF lines processed.
1000000 GFF lines processed.
1100000 GFF lines processed.
1200000 GFF lines processed.
1300000 GFF lines processed.
1400000 GFF lines processed.
1500000 GFF lines processed.
1600000 GFF lines processed.
1700000 GFF lines processed.
1800000 GFF lines processed.
1900000 GFF lines processed.
2000000 GFF lines processed.
2100000 GFF lines processed.
2200000 GFF lines processed.
2300000 GFF lines processed.
2400000 GFF lines processed.
2500000 GFF lines processed.
2600000 GFF lines processed.
2615566 GFF lines processed.
Error occured when reading first line of sam file.
Error: ("Malformed SAM line: MRNM == '*' although flag bit &0x0008 cleared", 'li
ne 28 of file data_sorted.sam')
[Exception type: ValueError, raised in _HTSeq.pyx:1262]

So, I searched the error and found this link on biostars: Error in htseq-count: "Malformed SAM line: MRNM == '*' although flag bit &0x0008 cleared"

I followed this protocol and executed the awk command. During the counting, I now receive warnings saying that each read claims to have a paired end partner that it cannot find. It then asks me if the file is sorted (which it is). This is a sample of the output I receive:

/u/local/python/2.6/lib64/python2.6/site-packages/HTSeq-0.5.3p3-py2.6-linux-x86_
64.egg/HTSeq/__init__.py:8: RuntimeWarning: numpy.flatiter size changed, may ind
icate binary incompatibility
  from _HTSeq import *
100000 GFF lines processed.
200000 GFF lines processed.
300000 GFF lines processed.
400000 GFF lines processed.
500000 GFF lines processed.
600000 GFF lines processed.
700000 GFF lines processed.
800000 GFF lines processed.
900000 GFF lines processed.
1000000 GFF lines processed.
1100000 GFF lines processed.
1200000 GFF lines processed.
1300000 GFF lines processed.
1400000 GFF lines processed.
1500000 GFF lines processed.
1600000 GFF lines processed.
1700000 GFF lines processed.
1800000 GFF lines processed.
1900000 GFF lines processed.
2000000 GFF lines processed.
2100000 GFF lines processed.
2200000 GFF lines processed.
2300000 GFF lines processed.
2400000 GFF lines processed.
2500000 GFF lines processed.
2600000 GFF lines processed.
2615566 GFF lines processed.
Warning: Skipping read 'UNC12-SN629_110:3:1101:3587:177592/1', because chromosome 'chrM_rCRS', to which it has been aligned, did not appear in the GFF file.
Warning: Read UNC12-SN629_110:3:1101:3587:199636/1 claims to have an aligned mate which could not be found. (Is the SAM file properly sorted?)
Warning: Read UNC12-SN629_110:3:1101:3588:15274/1 claims to have an aligned mate which could not be found. (Is the SAM file properly sorted?)
Warning: Skipping read 'UNC12-SN629_110:3:1101:3588:15274/1', because chromosome 'chrM_rCRS', to which it has been aligned, did not appear in the GFF file.
Warning: Read UNC12-SN629_110:3:1101:3588:21785/1 claims to have an aligned mate which could not be found. (Is the SAM file properly sorted?)
Warning: Read UNC12-SN629_110:3:1101:3588:99166/1 claims to have an aligned mate which could not be found. (Is the SAM file properly sorted?)
Warning: Read UNC12-SN629_110:3:1101:3588:138896/1 claims to have an aligned mate which could not be found. (Is the SAM file properly sorted?)

It is to my understanding that the bam files from the site include unmapped reads. I have also tried removing these with:

samtools view -b -f 0x2 data.bam > data_mappedPairs.bam

and following my previous protocol to no avail.

I have no idea what to try next. Any help is greatly appreciated.

RNA-Seq • 3.2k views

ADD COMMENT • link updated 6.4 years ago by Biostar 20 • written 9.4 years ago by dgtaylor13 • 0

1

Entering edit mode

Unrelated comment: you don't have to convert your bam files to sam to use in htseq-count, you can use -f bam option. Might save you some time/disk space.

ADD REPLY • link 9.4 years ago by WouterDeCoster 48k

0

Entering edit mode

To be honest, if I were you, I would just download the raw sequencing files and reprocess them myself from scratch.

ADD REPLY • link 9.4 years ago by jotan ★ 1.3k

0

Entering edit mode

hi, chrM_rCRS seems to be a non-standard version of mito. seq.. In any case you wouldn't find that in the GTF/ GFF file you are using. One solution is to use samtools view to extract ONLY those reads that have mapped to the standard/ primary chromosomes.

Also, since HTSeq works with name-sorted BAMs, so it would expect BOTH pairs to be present. If you are filtering out unmapped reads, then also ensure that pairs with both mates mapped are kept only. Can be accomplished as -

samtools view -F 12

(using only 4 would just filter unmapped, potentially leaving unpaired mates. Addin 8 to it would ensure that both mates mapped pairs are kept only.)

ADD REPLY • link 9.4 years ago by Amitm ★ 2.3k