Hello
I am doing a transcriptome analysis on Pseudomonas putida and I have been trying to do a read count using Htseq -count. The program always give an error. I have tried different genome references (fna) and annotation files (gtf ang gff) but it does not work.
The mapping works fine it seems but the read count doesnt.
I have run the script:
htseq-count -i gene_id -t CDS test_v2.sam ncbi_dataset_v2/ncbi_dataset/data/GCA_000007565.2/genomic.gtf > test_v2.count
I got the error:
built-in function delete__Pair_long_obj> returned a result with an error set
[Exception type: SystemError, raised in _HTSeq_internal.py:43]
If I include other options of htseq-count like "-s no" or "-m intersection-nonempty" Is the same error.
If I run only:
htseq-count test_v2.sam ncbi_dataset_v2/ncbi_dataset/data/GCA_000007565.2/genomic.gtf > test_v2.count
The same error.
So it must be the files compatibility? is there something wrong with the Pseudomonas putida gtf file?
This is an example of a couple of lines in the gtf file:
AE015451.2 Genbank gene 147 1019 . - . gene_id "PP_0001"; transcript_id ""; gbkey "Gene"; gene "parB"; gene_biotype "protein_coding"; locus_tag "PP_0001";
AE015451.2 Genbank CDS 150 1019 . - 0 gene_id "PP_0001"; transcript_id "unassigned_transcript_1"; gbkey "CDS"; gene "parB"; locus_tag "PP_0001"; note "Evidence 2a : Function of homologous gene experimentally demonstrated in an other organism; PubMedId : 16306995, 17462018, 17729285; Product type pf : putative factor; Biologicalprocesses : Replicate"; product "probable chromosome-partitioning protein"; protein_id "AAN65635.1"; transl_table "11"; exon_number "1";
This is how the fna file (I used for doing the mapping and creating the sam file) looks like:
>AE015451.2 Pseudomonas putida KT2440 complete genome
AACTGCTCCTCGGAAGTCGACCAACAAGTCAGCTATGACTTGGCATAATTTGTGCCGACAAAATGCGCGCAGAGTATAGG
GGTGGATTAACCCCTATTCAACTCTTCGGTAGTGATTTCCGACTTCACGCTACAACAGGAATTGTTTCAGCGGATGTGAG
CGAGCACGCCTTGCAACTCGTCAAGCGAGTTGTAGCGAATAACCAACTGGCCTTTGCCCTTGTTGCCATGACGGATCTGC
ACGGCCGAGCCCAGGCGCTCTGCGAGCCGCTGTTCAAGGCGTGCGATATCCGGATCAGGTTTGCTCGGTTCGACCGGATC
This is how the sam file looks like:
@HD VN:1.0 SO:unsorted
@SQ SN:AE015451.2 LN:6181873
@PG ID:hisat2 PN:hisat2 VN:2.1.0 CL:"/var/bin/hisat2-2.1.0/hisat2-align-s --wrapper basic-0 -x genome_pseudomonas_putida_v2 -S test_v2.sam --summary-file test_v2.sam.summary -1 ../NG-33893_Exp2_PP_25h_b_lib709417_10278_4_1_val_1.fq -2 ../NG-33893_Exp2_PP_25h_b_lib709417_10278_4_2_val_2.fq"
A00604:565:HFWKVDSX7:4:1101:4689:1000 83 AE015451.2 179677 1 150M1S = 174221 -5608 AGGAGGTTGGCTTAGAAGCAGCCACCCTTTAAAGAAAGCGTAATAGCTCACTAGTCGAGTCGGCCTGCGCGGAAGATGTAACGGGGCTCAAACCATGCACCGAAGCTACGGGTATCATCTTATGATGATGCGGTAGAGGAGCGTTCTGTAN FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF# AS:i:-1 ZS:i:-1 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:150 YS:i:-1 YT:Z:CP NH:i:5
What I am missing in my code ? is the id of the genomic feature incorrect?? Are the sam and gtf incompatible?
Thank you very much for your help.
Hi
Thanks for the suggestion.. I tried both kinds of sorting.. by name and by aligment (coordinate) but the same error persist when try to use htseq:
htseq-count -t CDS -i gene_id sorted_testv2.sam ncbi_dataset_v2/ncbi_dataset/data/GCA_000007565.2/genomic.gtf > test_v2.count <built-in function delete__Pair_long_obj> returned a result with an error set [Exception type: SystemError, raised in _HTSeq_internal.py:43]
This is what I did:
samtools sort -o sorted_testv2.sam test_v2.sam
I got:
@HD VN:1.0 SO:coordinate @SQ SN:AE015451.2 LN:6181873 @PG ID:hisat2 PN:hisat2 VN:2.1.0 CL:"/var/bin/hisat2-2.1.0/hisat2-align-s --wrapper basic-0 -x genome_pseudomonas_putida_v2 -S test_v2.sam --summary-file test_v2.sam.summary -1 ../NG-33893_Exp2_PP_25h_b_lib709417_10278_4_1_val_1.fq -2 ../NG-33893_Exp2_PP_25h_b_lib709417_10278_4_2_val_2.fq" A00604:565:HFWKVDSX7:4:1257:6686:25817 99 AE015451.2 1 60 24S126M = 41 213 GTTTGGCGTGCGTATTTCCATCACAACTGCTCCTCGGAAGTCGACCAACAAGTCAGCTATGACTTGGCATAATTTGTGCCGACAAAATGCGCGCAGAGTATAGGGGTGGATTAACCCCTATTCAACTCTTCGGTAGTGATTTCCGACTTC FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
samtools sort -n -o sorted2_testv2.sam test_v2.sam
I got:
@HD VN:1.0 SO:queryname @SQ SN:AE015451.2 LN:6181873 @PG ID:hisat2 PN:hisat2 VN:2.1.0 CL:"/var/bin/hisat2-2.1.0/hisat2-align-s --wrapper basic-0 -x genome_pseudomonas_putida_v2 -S test_v2.sam --summary-file test_v2.sam.summary -1 ../NG-33893_Exp2_PP_25h_b_lib709417_10278_4_1_val_1.fq -2 ../NG-33893_Exp2_PP_25h_b_lib709417_10278_4_2_val_2.fq" A00604:565:HFWKVDSX7:4:1101:1009:3270 69 AE015451.2 700109 0 * = 700109 0 TAAGACTCGCTTTCGCTACGCCTCNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNANNNAGTAGGCTCCCACTGCTTGTACGCATACGGGTTCAGGTTCTATTTAACTCCCC FFFFFFFF:F:FF,:F:FFFFF::######################################################################F###F::FF:F,F,FFFFF,F:,F,,F::FF:,F,,FFF,FFFFFFFFF:FFFFFFF YT:Z:UP YF:Z:NS A00604:565:HFWKVDSX7:4:1101:1009:3270 137 AE015451.2 700109 1 1S150M = 700109 0 NGGGACCAGCCCTTAAGTTGATTTGAGATTAGTGGAACGCTCTGGAAAGTGCGGCCAAAGTGGGTGATAGCCCCGTACACGAAAATCTCTTGTCAATGAAATCGAGTAGGACGGAGCACGAGAAACTTTGTCTGAATATGGGGGGACCATC #FF:FF:FFFFF:FFFFFFFF,F,FFFFFFFFFFFFFF,FFF:FF::FF,FFFFFFF,FFFFFFFF:FF:FFFFFF,FFFFFFFFFF,FFFF,FFFFFF:FFFFF:FFFFFFF:FFFFFFFFFFFFFFF,FFF,FFFF:,:FFFFF:FFFF AS:i:-4 ZS:i:-4 XN:i:0 XM:i:1 XO:i:0 XG:i:0 NM:i:1 MD:Z:56T93 YT:Z:UP NH:i:5
I don't see the -r option added to the htseq command?
Hi!
Thanks again for the answer.
Yes, I missed that part but when I include it it does not work either and produce the same error:
htseq-count -r pos -t CDS -i gene_id sorted_testv2.sam ncbi_dataset_v2/ncbi_dataset/data/GCA_000007565.2/genomic.gtf
built-in function delete__Pair_long_obj> returned a result with an error set [Exception type: SystemError, raised in _HTSeq_internal.py:43]
I was also suggested to use exon instead of CDS and write the code like this:
htseq-count -f sam -i gene_id -r pos -t exon sorted_testv2.sam ncbi_dataset_v2/ncbi_dataset/data/GCA_000007565.2/genomic.gtf > test_v2.count
I got the same error:
built-in function delete__Pair_long_obj> returned a result with an error set
[Exception type: SystemError, raised in _HTSeq_internal.py:43]
Any other idea?
Kind regards
Does the test work?