Question

Most reads counted by HTSeq are "unaligned"

0

Entering edit mode

8.3 years ago

dazhudou1122 ▴ 140

Hey guys,

I am new to RNA-Seq. I am trying to map some human tissue (contains bacteria) reads to a bacterial genome. I have used bowtie2 to filter all the remaining human reads, and mapped to a reference genome using:

bowtie2 -x /Users/winterlab/Desktop/Sequencing/RNA_seq_pipeline/bowtie2-2.2.9/Indexes/Fusobacterium_ATCC25586 -1 /Users/winterlab/Desktop/Sequencing/Fuso_RNA-seq_raw-data/Filtered_reads/SRR3170861.fastq -2 /Users/winterlab/Desktop/Sequencing/Fuso_RNA-seq_raw-data/Filtered_reads/SRR3170862.fastq -S /Users/winterlab/Desktop/Sequencing/Fuso_RNA-seq_raw-data/Reads_mapped_to_Fuso/FusoSRR317086.sam -p 3

And if I do have some reads mapped (using samtools flagstat) :

3615588 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
32849 + 0 mapped (0.91% : N/A)
3615588 + 0 paired in sequencing
1807794 + 0 read1
1807794 + 0 read2
654 + 0 properly paired (0.02% : N/A)
15826 + 0 with itself and mate mapped
17023 + 0 singletons (0.47% : N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)

But when I used HTSeq to count reads, mots of the reads are un-aligned:

(python -m HTSeq.scripts.count -s no -i ID /Users/winterlab/Desktop/Sequencing/Fuso_RNA-seq_raw-data/Reads_mapped_to_Fuso/sorted.sam /Users/winterlab/Desktop/Sequencing/RNA_seq_pipeline/Reference_genome_and_gff/ATCC_25586.gff)

4174 GFF lines processed.
Warning: Malformed SAM line: RNAME != '*' although flag bit &0x0004 set
Warning: Malformed SAM line: MRNM == '=' although read is not aligned.
Warning: Malformed SAM line: MRNM != '*' although flag bit &0x0008 set
100000 SAM alignment record pairs processed.
200000 SAM alignment record pairs processed.
300000 SAM alignment record pairs processed.
400000 SAM alignment record pairs processed.
...
...
id10    0
id11    0
id12    0
id13    0
id14    0
id15    0
id16    0
id17    0
id18    0
id19    0
...
...
__no_feature    13
__ambiguous 0
__too_low_aQual 24922
__not_aligned   1782858
__alignment_not_unique  0

The first few lines of the gff file: ##gff-version 3

 #!gff-spec-version 1.21
    #!processor NCBI annotwriter
    #!genome-build ASM732v1
    #!genome-build-accession NCBI_Assembly:GCF_000007325.1
    ##sequence-region NC_003454.1 1 2174500
    ##species https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=190304
    NC_003454.1 RefSeq  region  1   2174500 .   +   .   ID=id0;Dbxref=ATCC:25586,taxon:190304;Is_circular=true;Name=ANONYMOUS;gbkey=Src;genome=chromosome;mol_type=genomic DNA;strain=ATCC 25586;sub-species=nucleatum
    NC_003454.1 RefSeq  gene    77  709 .   -   .   ID=gene0;Dbxref=GeneID:992632;Name=FN1496;gbkey=Gene;gene_biotype=protein_coding;locus_tag=FN1496
    NC_003454.1 RefSeq  CDS 77  709 .   -   0   ID=cds0;Parent=gene0;Dbxref=Genbank:NP_602323.1,GeneID:992632;Name=NP_602323.1;gbkey=CDS;product=spore coat protein regulator protein YlbO;protein_id=NP_602323.1;transl_table=11
    NC_003454.1 RefSeq  gene    1104    2228    .   -   .   ID=gene1;Dbxref=GeneID:992723;Name=FN1497;gbkey=Gene;gene_biotype=protein_coding;locus_tag=FN1497
    NC_003454.1 RefSeq  CDS 1104    2228    .   -   0   ID=cds1;Parent=gene1;Dbxref=Genbank:NP_602324.1,GeneID:992723;Name=NP_602324.1;gbkey=CDS;product=multidrug resistance protein 2;protein_id=NP_602324.1;transl_table=11
    NC_003454.1 RefSeq  gene    2703    3602    .   +   .

First few lines of sam file :

@HD VN:1.0  SO:queryname
@SQ SN:NC_003454.1  LN:2174500
@PG ID:bowtie2  PN:bowtie2  VN:2.2.9    CL:"/Users/winterlab/Desktop/Sequencing/RNA_seq_pipeline/bowtie2-2.2.9/bowtie2-align-s --wrapper basic-0 -x /Users/winterlab/Desktop/Sequencing/RNA_seq_pipeline/bowtie2-2.2.9/Indexes/Fusobacterium_ATCC25586 -S /Users/winterlab/Desktop/Sequencing/Fuso_RNA-seq_raw-data/Reads_mapped_to_Fuso/FusoSRR317089.sam -p 3 -1 /Users/winterlab/Desktop/Sequencing/Fuso_RNA-seq_raw-data/Filtered_reads/SRR3170891.fastq -2 /Users/winterlab/Desktop/Sequencing/Fuso_RNA-seq_raw-data/Filtered_reads/SRR3170892.fastq"
SRR317089.45    77  *   0   0   *   *   0   0   CTTGGCTGTGGTTTCGCTGG    BCC?B@CC=@C@CCC@@@@@    YT:Z:UP
SRR317089.45    141 *   0   0   *   *   0   0   AGGGGAATCCGACTGTTTAATTAAAACA    @@@@@=?@0@@@@@@@@@=@@@@@@@@@    YT:Z:UP
SRR317089.48    77  *   0   0   *   *   0   0   AACGATCAGAGTAGTGGTATTTCACCGGCGGCCCGCAGGGCCG CCCAACA>ACCCBCCC@8@CB?<>BBB?@BBB@BBABBB<-?@ YT:Z:UP
SRR317089.48    141 *   0   0   *   *   0   0   CCCTACCTACTATCCAGCGAAACCACAGCCAAGGNAACGGGATTGGCGGAATCAGCGGG BBBB>B*BB><@B>@BBBBBBBB@BB=@BB0/00!68688B+B7<@@:@77;@3B6@>@ YT:Z:UP
SRR317089.49    77  *   0   0   *   *   0   0   AAGAAACTAACCAGGATTCCCTCAGTAACGGCGAGTGCACAGGGAAGAGCCCCGCGCC  @BB:BBB?*AAA?AA?:AAAAA?AA>B:BBAA?5A,<+;7-772+95499?;88?9??  YT:Z:UP
SRR317089.49    141 *   0   0   *   *   0   0   CGTGCCGGCATTTAGCCTTAGATGGAGTTTACCACCCGCTTTGGGCTGCATTCCCAAGCAACCC    ??B@?AAA3ABBBA<BB@BAA2A<:BBBBB8???;B=BB<*8B87A=?A68/?7?-/.><7?<?    YT:Z:UP
SRR317089.56    77  *   0   0   *   *   0   0   CATCGTTTATGGTCGGAACTACGACGGTATCTGATCGTCTTCGAACCTC   @@CC@=CC?C7789<::::2>>9+9B?1??BAA9=AA5A<2?B?5AA3A   YT:Z:UP
SRR317089.56    141 *   0   0   *   *   0   0   CGCAGCTAGGAATAATGGAATAGGACCGCGGTTCTATTTTGTTGGTTTTCGGAACTGAGGCCATGATGAAGAG   BAAB<A??<B??9<B6699>BA6@BA<A6A?=BA<<?B85??2988<8(9>>89;40??19???86:+3/?3?   YT:Z:UP
SRR317089.58    77  *   0   0   *   *   0   0   GAACAAGATGCAGAACCATGGCTATGAGAACCCCACCTACAAATACCTGGAGCAG AA>A?8>>.>47>09A8B?;8**<<9+.95:84?86(379;/:A?6?3?3'>7=@ YT:Z:UP

So I made sure the chromosome names are matching and everything so I cant figure out why.

Any input is highly appreciated:)

RNA-Seq alignment • 3.7k views

ADD COMMENT • link updated 8.3 years ago by GenoMax 147k • written 8.3 years ago by dazhudou1122 ▴ 140

0

Entering edit mode

< 1% reads mapped should be the first cause for concern.

Your filter step appears to have either failed or removed the bacterial reads as well. Look into a read-binning strategy using BBsplit and the original data files.

ADD REPLY • link 8.3 years ago by GenoMax 147k

0

Entering edit mode

Thanks for all the inputs guys! I appreciate it! I did what genomax2 and WouterDeCoster suggested, and skipped the "filtering human reads" step but it seems that the result is still pretty much the same:

id10    0
id11    0
id12    0
id13    0
id14    0
id15    0
id16    0
id17    0
id18    0
id19    0
id2 0
id20    0
id21    0
id22    0
id23    0
id24    0
id25    0
id26    0
id27    0
id28    0
id29    0
id3 0
id30    0
id31    0
id32    0
id33    0
id34    0
id35    0
id36    0
id37    0
id38    0
id39    0
id4 0
id40    0
id41    0
id42    0
id43    0
id44    0
id45    39
id46    0
id47    0
id48    0
id49    0
id5 0
id50    0
id51    0
id52    0
id53    0
id56    0
id57    0
id58    0
id59    0
id6 0
id60    0
id61    0
id62    0
id64    0
id65    0
id66    0
id7 0
id8 1
id9 0
__no_feature    45
__ambiguous 0
__too_low_aQual 30850
__not_aligned   2057335
__alignment_not_unique  0

As for the low mapped reads, it is kind of expected, because I am mapping human tissue reads to one single bacterial genome (there are a lot other bacteria present there). With that being said, I should have 2088270*0.96% overall alignment rate=20047 reads. That is the case when I searched for "NC_003454.1" in the sam file. I am so confused...

ADD REPLY • link updated 8.3 years ago by GenoMax 147k • written 8.3 years ago by dazhudou1122 ▴ 140

0

Entering edit mode

Is the chromosome ID (should be only one or two if there is a plasmid) matching in your sam file and the annotation file you are using?

ADD REPLY • link 8.3 years ago by GenoMax 147k

0

Entering edit mode

Thanks for the comments, genomax2! Can you specify what is the chromosome ID here? I am not sure that I understand correctly. Here are the first few lines of the gff file:

##gff-version 3
#!gff-spec-version 1.21
#!processor NCBI annotwriter
#!genome-build ASM732v1
#!genome-build-accession NCBI_Assembly:GCF_000007325.1
##sequence-region NC_003454.1 1 2174500
##species https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=190304
NC_003454.1 RefSeq  region  1   2174500 .   +   .   ID=id0;Dbxref=ATCC:25586,taxon:190304;Is_circular=true;Name=ANONYMOUS;gbkey=Src;genome=chromosome;mol_type=genomic DNA;strain=ATCC 25586;sub-species=nucleatum
NC_003454.1 RefSeq  gene    77  709 .   -   .   ID=gene0;Dbxref=GeneID:992632;Name=FN1496;gbkey=Gene;gene_biotype=protein_coding;locus_tag=FN1496
NC_003454.1 RefSeq  CDS 77  709 .   -   0   ID=cds0;Parent=gene0;Dbxref=Genbank:NP_602323.1,GeneID:992632;Name=NP_602323.1;gbkey=CDS;product=spore coat protein regulator protein YlbO;protein_id=NP_602323.1;transl_table=11
NC_003454.1 RefSeq  gene    1104    2228    .   -   .   ID=gene1;Dbxref=GeneID:992723;Name=FN1497;gbkey=Gene;gene_biotype=protein_coding;locus_tag=FN1497
NC_003454.1 RefSeq  CDS 1104    2228    .   -   0   ID=cds1;Parent=gene1;Dbxref=Genbank:NP_602324.1,GeneID:992723;Name=NP_602324.1;gbkey=CDS;product=multidrug resistance protein 2;protein_id=NP_602324.1;transl_table=11
NC_003454.1 RefSeq  gene    2703    3602    .   +   .   ID=gene2;Dbxref=GeneID:992745;Name=FN1498;gbkey=Gene;gene_biotype=protein_coding;locus_tag=FN1498
NC_003454.1 RefSeq  CDS 2703    3602    .   +   0   ID=cds2;Parent=gene2;Dbxref=Genbank:NP_602325.1,GeneID:992745;Name=NP_602325.1;gbkey=CDS;product=hypothetical protein;protein_id=NP_602325.1;transl_table=11
NC_003454.1 RefSeq  gene    3787    5226    .   +   .   ID=gene3;Dbxref=GeneID:991278;Name=FN1499;gbkey=Gene;gene_biotype=protein_coding;locus_tag=FN1499
NC_003454.1 RefSeq  CDS 3787    5226    .   +   0   ID=cds3;Parent=gene3;Dbxref=Genbank:NP_602326.1,GeneID:991278;Name=NP_602326.1;gbkey=CDS;product=cell surface protein;protein_id=NP_602326.1;transl_table=11
NC_003454.1 RefSeq  gene    5283    6017    .   -   .   ID=gene4;Dbxref=GeneID:992647;Name=FN1500;gbkey=Gene;gene_biotype=protein_coding;locus_tag=FN1500
NC_003454.1 RefSeq  CDS 5283    6017    .   -   0   ID=cds4;Parent=gene4;Dbxref=Genbank:NP_602327.1,GeneID:992647;Name=NP_602327.1;gbkey=CDS;product=nickel transport ATP-binding protein NikE;protein_id=NP_602327.1;transl_table=11
NC_003454.1 RefSeq  gene    6018    6800    .   -   .

ADD REPLY • link updated 8.3 years ago by GenoMax 147k • written 8.3 years ago by dazhudou1122 ▴ 140

1

Entering edit mode

Could you show us the chromosomes/contigs/... as those are present in the sam file, using samtools idxstats?

ADD REPLY • link 8.3 years ago by WouterDeCoster 47k

0

Entering edit mode

It seems that samtools idxstats only works on bam files. Because I set the output of the bowtie2 alignment as sam file, I did not convert them to bam. Do you think that could be a problem? Also, because I can open sam file in a text editor, I post the first few lines of the sam file in here:

@HD VN:1.0  SO:unsorted
@SQ SN:NC_003454.1  LN:2174500
@PG ID:bowtie2  PN:bowtie2  VN:2.2.9    CL:"/Users/winterlab/Desktop/Sequencing/RNA_seq_pipeline/bowtie2-2.2.9/bowtie2-align-s --wrapper basic-0 -x /Users/winterlab/Desktop/Sequencing/RNA_seq_pipeline/bowtie2-2.2.9/Indexes/Fusobacterium_ATCC25586 -S /Users/winterlab/Desktop/Sequencing/Fuso_RNA-seq_raw-data/Unfiltered_reads_mapped_to_Fuso/FusoCleanSRR317086.sam -p 3 -1 /Users/winterlab/Desktop/Sequencing/Fuso_RNA-seq_raw-data/paired_end_reads/CleanSRR317086_1.fastq -2 /Users/winterlab/Desktop/Sequencing/Fuso_RNA-seq_raw-data/paired_end_reads/CleanSRR317086_2.fastq"
SRR317086.602   77  *   0   0   *   *   0   0   GTAGCATATGCTTGTCTCAAAGATTAAGCCATGCATGTCTAAGTACGCACGGCCGGTACAGTGAAACTGCGAATGG    CABCCCCCCBCACCCBBBB>BBBBBBBBBBBBBBBBBBB@BBBBBBBBBBBBBBBBABBBBBB>B4BB>BBBAB@B    YT:Z:UP
SRR317086.602   141 *   0   0   *   *   0   0   GGCCGGAGAGGGGCTGACCGGGTTGGTTTTGATCTGATAAATGCACGCATCCCCCC    BBBBBB?B@BBBBBBBA>BBBB9AAAAAABBA<BBBABAABBB@B@@BBBBBBBBA    YT:Z:UP
SRR317086.603   77  *   0   0   *   *   0   0   GGAGACACCGCCGCAGTTGCCGGTACATCGGGGATT    <<>>:;;><>:<;<+?????AABA>877<<798(3?    YT:Z:UP
SRR317086.603   141 *   0   0   *   *   0   0   GGTTTCAGTTAAGCCTTGAATAATACAGTCTTGAAACTGAGTAGGGTCAAACCTCTCTTTTTCATCTCTTTTTCTA    ?B>B?B?B>?>>???BBB:AB?>B9>??>?BBBB@<BBB9BBBBAB0>9>>>>>>B<AABA==AB<<BBB948186    YT:Z:UP
SRR317086.601   77  *   0   0   *   *   0   0   CCAGAGTCTCGTTCGTTATCGGAATTAACCAGACAAATCGCTCCACCAACTAAGAACGGCCATGCACCACCACCCA    CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCACCCCCCCC    YT:Z:UP
SRR317086.601   141 *   0   0   *   *   0   0   GGGGAGTATGGTTGCAAAGCTGAAACTTAAAGGAATTGACGGAAGGGCACCACCAGGAGTGGAGCCTGCGGCTTAA    CCCCDCBCCCCCCCCCCCCCCCDDCCCCC@CCCBACCCBCCDB??BBBBBDDDB>CCACBDDADAABD?AB?A>@<    YT:Z:UP
SRR317086.604   77  *   0   0   *   *   0   0

ADD REPLY • link updated 8.3 years ago by GenoMax 147k • written 8.3 years ago by dazhudou1122 ▴ 140

0

Entering edit mode

I blasted GTAGCATATGCTTGTCTCAAAGATTAAGCCATGCATGTCTAAGTACGCACGGCCGGTACAGTGAAACTGCGAATGG, turns out to be ribosomal RNA. How was the library prep performed? Ribo depletion or poly A selection?

ADD REPLY • link 8.3 years ago by WouterDeCoster 47k

0

Entering edit mode

The library was prep using Ribo depletion.

ADD REPLY • link 8.3 years ago by dazhudou1122 ▴ 140

1

Entering edit mode

Chromosome id is NC_003454.1. You can either use @Wouter's command or show us the output of samtools view -H your_file.sam.

ADD REPLY • link 8.3 years ago by GenoMax 147k

0

Entering edit mode

I did what you suggested, and here is the out put:

@HD VN:1.0  SO:unsorted
@SQ SN:NC_003454.1  LN:2174500
@PG ID:bowtie2  PN:bowtie2  VN:2.2.9    CL:"/Users/winterlab/Desktop/Sequencing/RNA_seq_pipeline/bowtie2-2.2.9/bowtie2-align-s --wrapper basic-0 -x /Users/winterlab/Desktop/Sequencing/RNA_seq_pipeline/bowtie2-2.2.9/Indexes/Fusobacterium_ATCC25586 -S /Users/winterlab/Desktop/Sequencing/Fuso_RNA-seq_raw-data/Unfiltered_reads_mapped_to_Fuso/FusoCleanSRR317086.sam -p 3 -1 /Users/winterlab/Desktop/Sequencing/Fuso_RNA-seq_raw-data/paired_end_reads/CleanSRR317086_1.fastq -2 /Users/winterlab/Desktop/Sequencing/Fuso_RNA-seq_raw-data/paired_end_reads/CleanSRR317086_2.fastq"

ADD REPLY • link updated 8.3 years ago by GenoMax 147k • written 8.3 years ago by dazhudou1122 ▴ 140

0

Entering edit mode

At least the chromosome name is matching in your sam and annotation file. I think counting may not be working because your sam file is not sorted (look at the header). I suggest that you sort your file using samtools sort (convert it to BAM) and then count.

Alternatively you could use featureCounts to do the counting. It can sort the files as needed.

ADD REPLY • link 8.3 years ago by GenoMax 147k

0

Entering edit mode

Thank you so much, genomax2! You are life saver! featureCounts worked beautifully!

ADD REPLY • link 8.3 years ago by dazhudou1122 ▴ 140

score 0 · Answer 1 · 2016-08-29

0

Entering edit mode

8.3 years ago

WouterDeCoster 47k

The problem is not in the HTSeq-count step, the reads really are mostly unmapped. Something went wrong in the mapping... Why do you use bowtie and not tophat/hisat/star? Bowtie isn't aware of splice sites which will likely be your problem.

ADD COMMENT • link 8.3 years ago by WouterDeCoster 47k

0

Entering edit mode

OP is using bowtie2 (with a bacterial reference). It appears that the "filtering human reads out", referred to by OP, must not have worked right.

ADD REPLY • link 8.3 years ago by GenoMax 147k

0

Entering edit mode

You're right, should have read it more focussed :(

ADD REPLY • link 8.3 years ago by WouterDeCoster 47k

0

Entering edit mode

Thanks for all the inputs guys! I appreciate it! I did what genomax2 and WouterDeCoster suggested, and skipped the "filtering human reads" step but it seems that the result is still pretty much the same: id10 0 id11 0 id12 0 id13 0 id14 0 id15 0 id16 0 id17 0 id18 0 id19 0 id2 0 id20 0 id21 0 id22 0 id23 0 id24 0 id25 0 id26 0 id27 0 id28 0 id29 0 id3 0 id30 0 id31 0 id32 0 id33 0 id34 0 id35 0 id36 0 id37 0 id38 0 id39 0 id4 0 id40 0 id41 0 id42 0 id43 0 id44 0 id45 39 id46 0 id47 0 id48 0 id49 0 id5 0 id50 0 id51 0 id52 0 id53 0 id56 0 id57 0 id58 0 id59 0 id6 0 id60 0 id61 0 id62 0 id64 0 id65 0 id66 0 id7 0 id8 1 id9 0 __no_feature 45 __ambiguous 0 __too_low_aQual 30850 __not_aligned 2057335 __alignment_not_unique 0

As for the low mapped reads, it is kind of expected, because I am mapping human tissue reads to one single bacterial genome (there are a lot other bacteria present there). With that being said, I should have 2088270*0.96% overall alignment rate=20047 reads. That is the case when I searched for "NC_003454.1" in the sam file. I am so confused...

ADD REPLY • link 8.3 years ago by dazhudou1122 ▴ 140