Nothing will align with HTSeq?
0
0
Entering edit mode
9.2 years ago

Hello,

I'm trying to extract raw count data from a .bam file (output by Tophat2) for use in DESeq2. I'm trying to use HTSeq. Although I can get HTSeq to run, nothing aligns uniquely. The reads themselves are 100 bp long, so I'm surprised that's the case.

I input something like:

htseq-count -f bam -r pos /path/on/cluster/mySampleBam/accepted_hits.bam /path/on/cluster/homoSapGRCh38.gtf

And the output file looks like:

...
ENSG00000282813    0
ENSG00000282814    0
ENSG00000282815    0
ENSG00000282816    0
ENSG00000282817    0
ENSG00000282818    0
ENSG00000282819    0
ENSG00000282820    0
ENSG00000282821    0
ENSG00000282822    0
__no_feature    16996078
__ambiguous    0
__too_low_aQual    0
__not_aligned    0
__alignment_not_unique    2579821

Has anyone had this issue? What might be going on?

I previously sorted the .bam file I'm trying to generate counts for with the command:

samtools sort accepted_hits.bam accepted_hits_sorted
RNA-Seq htseq • 2.6k views
ADD COMMENT
1
Entering edit mode

Just use the -o option and have a look at some of the reads that should be counted. I hope you looked at these in IGV or similar already.

ADD REPLY
0
Entering edit mode

I haven't - what's IGV? I'm new to this, so any suggestions on how to sanity check as I build my pipeline would be appreciated.

ADD REPLY
1
Entering edit mode

This is IGV, it's one of a number of viewers for BAM (and other) files. It's always good to have a brief visual look at things when you get weird results. You can then find some reads that should be counted (but obviously aren't), use the -o option and then grep for the read names and see why they weren't counted.

ADD REPLY
0
Entering edit mode

What's the output of:

samtools view accepted_hits.bam | cut -f5 | uniq -c 
ADD REPLY
0
Entering edit mode

A very long list of numbers that looks like:

10 0
2 1
1 3
2 0
3 3
1 1
1 3
2 1
1 3
1 0
1 1
1 50
1 1
3 0
1 50
3 3
...

EDIT: It has this many lines: 1566241

ADD REPLY
0
Entering edit mode

sorry. This command:

samtools view accepted_hits.bam | cut -f5 | sort -n | uniq -c
ADD REPLY
0
Entering edit mode

Thanks! Output is 4 lines:

1451670 0
1639036 1
2057140 3
33922308 50
ADD REPLY
0
Entering edit mode
head -5 homoSapGRCh38.gtf
samtools view accepted_hits.bam | head -5

Output?

ADD REPLY
0
Entering edit mode

For the .gtf file:

#!genome-build GRCh38.p3
#!genome-version GRCh38
#!genome-date 2013-12
#!genome-build-accession NCBI:GCA_000001405.18
#!genebuild-last-updated 2015-06

For the .bam file:

MG00HS20:702:HYKM2ADXX:1:2209:18725:32851    419    chr1    12042    0    101M    =    12100    159    CAGGGTGCAAGCTGAGCACTGGAGTGGAGTTTTCCTGTGGAGAGGAGCCATGCCTAGAGTGGGATGGGCCATTGTTCATCTTCTGGCCCCTGTTGTCTGCA    $$$&(&((*(***+)++*)+))$(%))+)(*)))))))++)))(*++)'*&*%'+++++&))**+"%$%%%%&&&''&''&&'&&&&%%%%&&&&&&&%&#    AS:i:0    XN:i:0    XM:i:0    XO:i:0    XG:i:0    NM:i:0    MD:Z:101    YT:Z:UU    XS:A:+    NH:i:5    CC:Z:chr15    CP:i:102519028    HI:i:0
MG00HS20:702:HYKM2ADXX:2:2108:1447:45504    353    chr1    12058    0    101M    =    12253    296    CACTGGAGTGGAGTTTTCCTGTGGAGAGGAGCCATGCCTAGAGTGGGATGGGCCATTGTTCATCTTCTGGCCCCTGTTGTCTGCATGTAACTTAATACCAC    &&&(('(&&&(&()+++++++*)++++++)++++++++)*+++&&*+*+++++++++++++++++++*****(((&'''&'&&&&&&&''&&'&&''&&&&    AS:i:0    XN:i:0    XM:i:0    XO:i:0    XG:i:0    NM:i:0    MD:Z:101    YT:Z:UU    XS:A:+    NH:i:5    CC:Z:chr15    CP:i:102519012    HI:i:0
MG00HS20:702:HYKM2ADXX:1:1102:20777:96654    99    chr1    12066    0    101M    =    12100    135    TGGAGTTTCCCTGTGGAGACGAGCCATGCCTAGAGTGGGATGGGCCATTGTTCATCTTCTGGCCCCTGTTGTCTGCATGTAACTTAATACCACAACCAGGC    $$$&&%&(***)*))))'+))*()*+)*+++))++&)%'*&*+(&*+++((*)+&)&&)*++++'($%$%%&'&'''&'!#$&$&&&&&&&&&&&%&&&%&    AS:i:-4    XN:i:0    XM:i:2    XO:i:0    XG:i:0    NM:i:2    MD:Z:8T10G81    YT:Z:UU    XS:A:+    NH:i:5    CC:Z:chr15    CP:i:102519004    HI:i:0
MG00HS20:702:HYKM2ADXX:2:1107:5154:17536    355    chr1    12066    0    101M    =    12100    135    TGGAGTTTCCCTGTGGAGACGAGCCATGCCTAGAGTGGGATGGGCCATTGTTCATCTTCTGGCCCCTGTTGTCTGCATGTAACTTAATACCACAACCAGGC    $""&&"%&*(****)))*(*+%*)+)+)&(&*')*"%"&&*)'*&&"()+))+***))**+++&***'((('''&'&&'#%&&&'&$&&&&&&%%%$%%%%    AS:i:-4    XN:i:0    XM:i:2    XO:i:0    XG:i:0    NM:i:2    MD:Z:8T10G81    YT:Z:UU    XS:A:+    NH:i:5    CC:Z:chr15    CP:i:102519004    HI:i:0
MG00HS20:702:HYKM2ADXX:1:1102:20777:96654    147    chr1    12100    0    101M    =    12066    -135    GTGGGATGGGCCATTGTTCATCTTCTGGCCCCTGTTGTCTGCATGTAACTTAATACCACAACCAGGCATAGGGGAAAGATTGGAGGAAAGATGAGTGAGAG    $&&&&&&%&$"$$%&&&&&&&'''('&'%)&'+++++++)++++++*(('++*(&*&%*(&)&)*))'+++++++&++*+)++++)*'&*(**&((((&%&    AS:i:0    XN:i:0    XM:i:0    XO:i:0    XG:i:0    NM:i:0    MD:Z:101    YT:Z:UU    XS:A:+    NH:i:5    CC:Z:chr15    CP:i:102518970    HI:i:0
ADD REPLY
0
Entering edit mode

I think Goutham Atla wanted to know if the chromosome-naming in your bam file and the GTF are the same: In your sam-file, you have Chromosome One written as chr1. Now you have to look if it's matching with the gtf:

grep chr1 homoSapGRCh38.gtf | head
ADD REPLY
0
Entering edit mode

and what is the output of:

samtools flagstat accepted_hits_sorted.bam
ADD REPLY
0
Entering edit mode
39070154 + 0 in total (QC-passed reads + QC-failed reads)
3333382 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
39070154 + 0 mapped (100.00%:-nan%)
35736772 + 0 paired in sequencing
17868386 + 0 read1
17868386 + 0 read2
30998738 + 0 properly paired (86.74%:-nan%)
35736772 + 0 with itself and mate mapped
0 + 0 singletons (0.00%:-nan%)
43100 + 0 with mate mapped to a different chr
19858 + 0 with mate mapped to a different chr (mapQ>=5)
ADD REPLY

Login before adding your answer.

Traffic: 1769 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6