Question

Mapping Illumina reads against one single gene

2

Entering edit mode

10.6 years ago

User000 ▴ 750

Hello,

I have approx. 250,000,000 Illumina paired-end reads of 100bp length from one organism. I have a gene of 5000bp length from another organism. I wanted to map reads to my sequence and create a consensus sequence. My question is: Is it OK mapping illumina DNA-seq or RNA-seq reads against one single gene? Cause in literature everything is mapped against the whole genome, which in my case is limited due to computer space. Any suggestions are appreciated.

next-gen • 12k views

ADD COMMENT • link updated 3.4 years ago by Ram 45k • written 10.6 years ago by User000 ▴ 750

0

Entering edit mode

Just a thought: Presumably you don't really need to align *all* the 250M reads, unless you need a coverage of millions x (!!). So maybe better to get a sample of few hundred thousand, good quality reads and align them with a sensitive aligner like blast. I mean, NGS aligners are usually a trade off between speed, efficiency and sensitivity/specificity. In your case efficiency is not an issue as the reference is small and you might get away with few reads.

ADD REPLY • link 10.6 years ago by dariober 15k

0

Entering edit mode

But that only works, if the complete genome is small. A few 100k of 250M is not enough, if we are talking about a big genome.

ADD REPLY • link 10.6 years ago by David Langenberger 11k

1

Entering edit mode

Ahhh... Sure I misunderstood the question, I thought the 250M reads came from that gene alone as if they sequenced just that amplicon... Apologies about the confusion.

ADD REPLY • link 10.6 years ago by dariober 15k

Ram · Answer 1 · 2015-01-21

1

Entering edit mode

10.6 years ago

GouthamAtla 12k

There is nothing wrong in mapping the reads to a single gene. But try to use bwa-mem instead of bowtie2 as it is different organism. You can also try to do a local blast (convert fastq to fasta) against the gene of interest. But you should be really careful with reads that are generated partially from gene of interest.

ADD COMMENT • link updated 3.4 years ago by Ram 45k • written 10.6 years ago by GouthamAtla 12k

0

Entering edit mode

Thank you for the answer. Indeed, I used BWA. So in order to save time and space If I know the coordinates, I can just take my short sequence and map against it.

ADD REPLY • link updated 3.4 years ago by Ram 45k • written 10.6 years ago by User000 ▴ 750

0

Entering edit mode

you can use bedtools getfasta to get fasta sequence from regions of interest (in a bed format).

ADD REPLY • link updated 3.4 years ago by Ram 45k • written 10.6 years ago by GouthamAtla 12k

0

Entering edit mode

Thanks, what do you mean by 'reads that are generated partially from gene of interest'?

ADD REPLY • link updated 3.4 years ago by Ram 45k • written 10.6 years ago by User000 ▴ 750

1

Entering edit mode

if its a whole genome sequence data, the entire read might not have generated only from genic region. So first or last part of read is from gene and rest is from non-genic region. When you map this read only to gene, it will map partially, and hence you should look at CIGAR string and keep in mind about soft/hard clipping.

ADD REPLY • link updated 3.4 years ago by Ram 45k • written 10.6 years ago by GouthamAtla 12k

Ram · Answer 2 · 2015-01-21

1

Entering edit mode

10.6 years ago

David Langenberger 11k

I don't know the question you want to answer, but perhaps it would make sense to add this one gene to the whole genome sequence of the species with which you did the experiment (assuming you have it). Just add it to the whole genome fasta as a new fasta entry in the end and create the index over the new fasta file. This way, you would assure that reads that have a better hit in the whole genome will not be mis-places (with more errors) to the one gene. But if you do that, you should allow multiple mapping loci, or turn of multiple mapping completely (depending on the tool you use).

ADD COMMENT • link updated 3.4 years ago by Ram 45k • written 10.6 years ago by David Langenberger 11k

0

Entering edit mode

but if you add it to whole genome, definitely reads will map to two regions, one to the genome (region where actual gene is present) and to the extra added gene. SO I think we should definitely enable to report multiple mapping locations.

ADD REPLY • link updated 3.4 years ago by Ram 45k • written 10.6 years ago by GouthamAtla 12k

0

Entering edit mode

I also think that you should enable the multiple mappings.

But as I said: It highly depends on the question you want to answer.

Lets say User000 wants to find only reads that map uniquely to this one gene and nowhere else. Question would be: Do I have that gene present in my organism. Or: Was that gene introduced to my genome, or not. Then it would be good to turn of the multiple mapping option, since most tools need much more time with this parameter turned on.

But I have no clue, what the question might be! ;)

ADD REPLY • link updated 3.4 years ago by Ram 45k • written 10.6 years ago by David Langenberger 11k

0

Entering edit mode

I want to see if that particular gene is present in sequenced organism, basically, this is why I am interested in doing mapping of reads against only one gene.

ADD REPLY • link updated 3.4 years ago by Ram 45k • written 10.6 years ago by User000 ▴ 750

0

Entering edit mode

and if I do not have the whole genome sequence..? Basically, the question is, does it make sense to map illumina seq reads using as a reference genome just one single gene in fasta format..

ADD REPLY • link updated 3.4 years ago by Ram 45k • written 10.6 years ago by User000 ▴ 750

1

Entering edit mode

It makes sense, but don't trust mappings with too many errors, since they have a high chance to be wrong mapping loci.

ADD REPLY • link updated 3.4 years ago by Ram 45k • written 10.6 years ago by David Langenberger 11k

0

Entering edit mode

Yes. you can.

ADD REPLY • link updated 3.4 years ago by Ram 45k • written 10.6 years ago by GouthamAtla 12k