Question

Help with htseq -count read counts

0

Entering edit mode

16 months ago

Juan Pablo • 0

Hello

I am doing a transcriptome analysis on Pseudomonas putida and I have been trying to do a read count using Htseq -count. The program always give an error. I have tried different genome references (fna) and annotation files (gtf ang gff) but it does not work.

The mapping works fine it seems but the read count doesnt.

I have run the script:

htseq-count -i gene_id -t CDS test_v2.sam ncbi_dataset_v2/ncbi_dataset/data/GCA_000007565.2/genomic.gtf > test_v2.count

I got the error:

built-in function delete__Pair_long_obj> returned a result with an error set
  [Exception type: SystemError, raised in _HTSeq_internal.py:43]

If I include other options of htseq-count like "-s no" or "-m intersection-nonempty" Is the same error.

If I run only:

htseq-count test_v2.sam ncbi_dataset_v2/ncbi_dataset/data/GCA_000007565.2/genomic.gtf > test_v2.count

The same error.

So it must be the files compatibility? is there something wrong with the Pseudomonas putida gtf file?

This is an example of a couple of lines in the gtf file:

AE015451.2      Genbank gene    147     1019    .       -       .       gene_id "PP_0001"; transcript_id ""; gbkey "Gene"; gene "parB"; gene_biotype "protein_coding"; locus_tag "PP_0001"; 

AE015451.2      Genbank CDS     150     1019    .       -       0       gene_id "PP_0001"; transcript_id "unassigned_transcript_1"; gbkey "CDS"; gene "parB"; locus_tag "PP_0001"; note "Evidence 2a : Function of homologous gene experimentally demonstrated in an other organism; PubMedId : 16306995, 17462018, 17729285; Product type pf : putative factor; Biologicalprocesses : Replicate"; product "probable chromosome-partitioning protein"; protein_id "AAN65635.1"; transl_table "11"; exon_number "1";

This is how the fna file (I used for doing the mapping and creating the sam file) looks like:

>AE015451.2 Pseudomonas putida KT2440 complete genome
AACTGCTCCTCGGAAGTCGACCAACAAGTCAGCTATGACTTGGCATAATTTGTGCCGACAAAATGCGCGCAGAGTATAGG
GGTGGATTAACCCCTATTCAACTCTTCGGTAGTGATTTCCGACTTCACGCTACAACAGGAATTGTTTCAGCGGATGTGAG
CGAGCACGCCTTGCAACTCGTCAAGCGAGTTGTAGCGAATAACCAACTGGCCTTTGCCCTTGTTGCCATGACGGATCTGC
ACGGCCGAGCCCAGGCGCTCTGCGAGCCGCTGTTCAAGGCGTGCGATATCCGGATCAGGTTTGCTCGGTTCGACCGGATC

This is how the sam file looks like:

@HD     VN:1.0  SO:unsorted
@SQ     SN:AE015451.2   LN:6181873
@PG     ID:hisat2       PN:hisat2       VN:2.1.0        CL:"/var/bin/hisat2-2.1.0/hisat2-align-s --wrapper basic-0 -x genome_pseudomonas_putida_v2 -S test_v2.sam --summary-file test_v2.sam.summary -1 ../NG-33893_Exp2_PP_25h_b_lib709417_10278_4_1_val_1.fq -2 ../NG-33893_Exp2_PP_25h_b_lib709417_10278_4_2_val_2.fq"
A00604:565:HFWKVDSX7:4:1101:4689:1000   83      AE015451.2      179677  1       150M1S  =       174221  -5608   AGGAGGTTGGCTTAGAAGCAGCCACCCTTTAAAGAAAGCGTAATAGCTCACTAGTCGAGTCGGCCTGCGCGGAAGATGTAACGGGGCTCAAACCATGCACCGAAGCTACGGGTATCATCTTATGATGATGCGGTAGAGGAGCGTTCTGTAN FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF# AS:i:-1 ZS:i:-1 XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:150        YS:i:-1 YT:Z:CP NH:i:5

What I am missing in my code ? is the id of the genomic feature incorrect?? Are the sam and gtf incompatible?

Thank you very much for your help.

sam gtf RNA-seq htseq • 924 views

ADD COMMENT • link updated 16 months ago by rfran010 ★ 1.3k • written 16 months ago by Juan Pablo • 0

score 1 · Answer 1 · 2023-07-19

1

Entering edit mode

16 months ago

rfran010 ★ 1.3k

Could your sam file need sorting?

-r <order>, --order=<order>
For paired-end data, the alignment have to be sorted either by read name or by alignment position. If your data is not sorted, use the samtools sort function of samtools to sort it. Use this option, with name or pos for <order> to indicate how the input data has been sorted. The default is name.

ADD COMMENT • link 16 months ago by rfran010 ★ 1.3k

0

Entering edit mode

Hi

Thanks for the suggestion.. I tried both kinds of sorting.. by name and by aligment (coordinate) but the same error persist when try to use htseq:

htseq-count -t CDS -i gene_id sorted_testv2.sam ncbi_dataset_v2/ncbi_dataset/data/GCA_000007565.2/genomic.gtf > test_v2.count <built-in function delete__Pair_long_obj> returned a result with an error set [Exception type: SystemError, raised in _HTSeq_internal.py:43]

This is what I did:

samtools sort -o sorted_testv2.sam test_v2.sam

I got:

@HD VN:1.0 SO:coordinate @SQ SN:AE015451.2 LN:6181873 @PG ID:hisat2 PN:hisat2 VN:2.1.0 CL:"/var/bin/hisat2-2.1.0/hisat2-align-s --wrapper basic-0 -x genome_pseudomonas_putida_v2 -S test_v2.sam --summary-file test_v2.sam.summary -1 ../NG-33893_Exp2_PP_25h_b_lib709417_10278_4_1_val_1.fq -2 ../NG-33893_Exp2_PP_25h_b_lib709417_10278_4_2_val_2.fq" A00604:565:HFWKVDSX7:4:1257:6686:25817 99 AE015451.2 1 60 24S126M = 41 213 GTTTGGCGTGCGTATTTCCATCACAACTGCTCCTCGGAAGTCGACCAACAAGTCAGCTATGACTTGGCATAATTTGTGCCGACAAAATGCGCGCAGAGTATAGGGGTGGATTAACCCCTATTCAACTCTTCGGTAGTGATTTCCGACTTC FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF

samtools sort -n -o sorted2_testv2.sam test_v2.sam

I got:

@HD VN:1.0 SO:queryname @SQ SN:AE015451.2 LN:6181873 @PG ID:hisat2 PN:hisat2 VN:2.1.0 CL:"/var/bin/hisat2-2.1.0/hisat2-align-s --wrapper basic-0 -x genome_pseudomonas_putida_v2 -S test_v2.sam --summary-file test_v2.sam.summary -1 ../NG-33893_Exp2_PP_25h_b_lib709417_10278_4_1_val_1.fq -2 ../NG-33893_Exp2_PP_25h_b_lib709417_10278_4_2_val_2.fq" A00604:565:HFWKVDSX7:4:1101:1009:3270 69 AE015451.2 700109 0 * = 700109 0 TAAGACTCGCTTTCGCTACGCCTCNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNANNNAGTAGGCTCCCACTGCTTGTACGCATACGGGTTCAGGTTCTATTTAACTCCCC FFFFFFFF:F:FF,:F:FFFFF::######################################################################F###F::FF:F,F,FFFFF,F:,F,,F::FF:,F,,FFF,FFFFFFFFF:FFFFFFF YT:Z:UP YF:Z:NS A00604:565:HFWKVDSX7:4:1101:1009:3270 137 AE015451.2 700109 1 1S150M = 700109 0 NGGGACCAGCCCTTAAGTTGATTTGAGATTAGTGGAACGCTCTGGAAAGTGCGGCCAAAGTGGGTGATAGCCCCGTACACGAAAATCTCTTGTCAATGAAATCGAGTAGGACGGAGCACGAGAAACTTTGTCTGAATATGGGGGGACCATC #FF:FF:FFFFF:FFFFFFFF,F,FFFFFFFFFFFFFF,FFF:FF::FF,FFFFFFF,FFFFFFFF:FF:FFFFFF,FFFFFFFFFF,FFFF,FFFFFF:FFFFF:FFFFFFF:FFFFFFFFFFFFFFF,FFF,FFFF:,:FFFFF:FFFF AS:i:-4 ZS:i:-4 XN:i:0 XM:i:1 XO:i:0 XG:i:0 NM:i:1 MD:Z:56T93 YT:Z:UP NH:i:5

ADD REPLY • link 16 months ago by Juan Pablo • 0

1

Entering edit mode

I don't see the -r option added to the htseq command?

ADD REPLY • link 16 months ago by rfran010 ★ 1.3k

0

Entering edit mode

Hi!

Thanks again for the answer.

Yes, I missed that part but when I include it it does not work either and produce the same error:

htseq-count -r pos -t CDS -i gene_id sorted_testv2.sam ncbi_dataset_v2/ncbi_dataset/data/GCA_000007565.2/genomic.gtf

built-in function delete__Pair_long_obj> returned a result with an error set [Exception type: SystemError, raised in _HTSeq_internal.py:43]

I was also suggested to use exon instead of CDS and write the code like this:

htseq-count -f sam -i gene_id -r pos -t exon sorted_testv2.sam ncbi_dataset_v2/ncbi_dataset/data/GCA_000007565.2/genomic.gtf > test_v2.count

I got the same error:

built-in function delete__Pair_long_obj> returned a result with an error set

[Exception type: SystemError, raised in _HTSeq_internal.py:43]

Any other idea?

Kind regards

ADD REPLY • link 16 months ago by Juan Pablo • 0

0

Entering edit mode

Does the test work?

ADD REPLY • link 16 months ago by rfran010 ★ 1.3k