Question

Bacterial Genome-Transcriptome mapping anomaly

0

Entering edit mode

9.1 years ago

SJ Basu ▴ 60

Hello,

I am doing snp and DGE analysis of WTA data from two samples from E.Coli K12 strain. When I mapped the reads on to the reference genome, I got >97% mapping for both samples...but when I mapped those same reads on to the reference transcriptome, I got <25% of reads mapped for both samples...

To check if I downloaded the right transcriptome, I mapped the transcriptome sequences on to the genome and it showed 99% mapped and Bowtie2 was used to perform the mappings.

Unlike eukaryotic genome, they don't even have splice junction, then what can be the case...Can somebody please explain this anomaly ???!!!!

RNA-Seq genome transcriptome SNP prokaryotic • 2.8k views

ADD COMMENT • link 9.1 years ago by SJ Basu ▴ 60

0

Entering edit mode

You should post all programs you used for this analysis (I see you listed Bowtie2) and their version (if relevant), the reference genome identifier, and the reference transcriptome identifier as minimal information. Avoid acronyms where possible and replace them with the full length terminology. Also include the site from which you obtained the reference files.

ADD REPLY • link 9.1 years ago by Steven Lakin ★ 1.8k

score 4 · Accepted Answer · 2016-07-13

4

Entering edit mode

9.1 years ago

Asaf 10k

Maybe it's the effect of operons or UTRs. Have a look at the transcriptome and see if it's really transcriptome or just coding sequences, in bacteria it's mostly the second option. If a read spans two genes in an operon and your reads are paired-end you won't get a lot of alignment. The same thing happens with UTRs, if the read falls at the end of the transcript (which happens quite a lot) you won't see it. Another option is that the transcriptome file you have doesn't contain ncRNAs (tRNAs, rRNAs, sRNAs etc.) they take up a lot of the reads.

ADD COMMENT • link 9.1 years ago by Asaf 10k

0

Entering edit mode

Hello Asaf,

I seriously think that can be the reason because I downloaded cDNA from this link ftp://ftp.ensemblgenomes.org/pub/bacteria/release-31/fasta/bacteria_0_collection/escherichia_coli_str_k_12_substr_mg1655/cdna/ ....

1) my read length was 2X150, can it cover more than a gene in an operon ???!!!!...

2) the transcriptome file, to which I mapped the reads, showed considerable mapping (out of ~4400 "gene", ~4300 "gene" showed reads mapping to them) ....can I go ahead to differential gene expression with these because I can't find any other source of transcriptome (at least a reliable source) for my strain ??? (I've used samtools to create read count file and DESeq to perform dge)

ADD REPLY • link 9.1 years ago by SJ Basu ▴ 60

1

Entering edit mode

I would recommend you to: 1. Map to the UCSC genome browser genome and upload your bam to the browser, you'll be able to see where your missing reads map to. 2. Map to the genome and count the number of reads mapping to each gene. Beware though that reads mapping to more than one gene will be discarded if you use HTseq-count, I wrote a little script that in this case count the read for each gene, it makes the DESeq2 results a bit of a lie but it works. Alternatively, count the reads mapping to operons, you can get these from e.g.: http://csbl.bmb.uga.edu/DOOR/displayNCoperon.php?id=1944&page=1&nc=NC_000913#tabs-1 To answer your point 1: yes, it doesn't matter how much you read, the fragment size matters and if it's large (hundreds of bases) it might fall on two genes. Reading 150x2 for DE is a huge overhead 50x1 would do.

ADD REPLY • link 9.1 years ago by Asaf 10k