Best Existing Tool For Computing The Overlaps Between Each Read In A Set Of Short Reads?
4
3
Entering edit mode
12.4 years ago
Tianyang Li ▴ 500

Hi,

What is the best tool for computing the overlaps between each read in a set of short reads?

For example, I have a set of reads from an RNA-Seq project, and for a specific read from that set of reads, I'd like to find all the reads that overlap significantly with the chosen read.

It must be also noted that I'm doing this de novo, so no reference is available.

I don't know if BLAST's fast enough for this task, because in my past experience with BLAST it seems rather slow.

I've seen this paper, but I'd like something such as Bowtie or BWA, but I haven't found such a tool.

Thanks!

overlap read alignment short • 4.2k views
ADD COMMENT
0
Entering edit mode

I think you mean assembling reads to contigs and getting scaffolds? Isn't this what a de novo assembler like trinity does? Else, what do you mean by overlap of reads?

ADD REPLY
0
Entering edit mode

I'm actually using assembly results from Trinity, but the contigs are a bit short and Trinity doesn't do scaffolding.

ADD REPLY
1
Entering edit mode
12.4 years ago

Is there any reason why you can't use bwa? Treat your read as the "reference genome" and map the other reads to it? Zam

ADD COMMENT
0
Entering edit mode

One reason is that from BWA's documentation (although I haven't tested it yet), it only accepts FASTQ as query sequences, and it's designed for mapping short reads to long references.

ADD REPLY
0
Entering edit mode

Oh. Well Stampy will definitely do it, and allows you to map FASTA.

ADD REPLY
1
Entering edit mode
12.4 years ago

If you have reads and co-ordinates for them, why not use IntersectBed from the bedtools. BWA & Bowtie are more for mapping the reads to a reference genome, I am not getting the point. Try mapping your reads from RNA-Seq using Tophat which use Bowtie and then convert the accepted_hits.bam (reads mapping to your organism's genome) to a bed file (like using bamToBed). Then make a subset of reads from fileA and used it to intersect with fileB. Check out manual, its easy to understand.

Cheers

ADD COMMENT
0
Entering edit mode

I forgot to mention that I'm doing this de novo, so no reference genome is available.

ADD REPLY
1
Entering edit mode
12.4 years ago

Usually the word overlap refers to overlapping intervals. Each read is mapped to genomic coordinates and then one looks at overlapping intervals of these coordinates. Some of the answers assume that this is what you meant.

I think what you seem to want to do is align one read against all other reads with blast or some other aligner. That does not really sound like a good idea at all, it wont work they way you expect it to.

What you probably want is to do denovo transcriptome assembly, this will create contigs of the overlapping reads and while the information on the overlap is not usually something that people need or want, depending on the tool you wishould able to extract the information from the intermediate data..

ADD COMMENT
0
Entering edit mode

(+1) requirement is a de novo assmebler

ADD REPLY
0
Entering edit mode

I'm actually trying to make the contigs from de novo assembler better. Assemblers such as Trinity produce contigs that are usually a bit shorter because of errors in reads.

ADD REPLY
1
Entering edit mode
12.4 years ago
Rm 8.3k
  1. Use FASTA/Q Collapser : ( or both ) CD-hit / uclust (more info)

2 : Reads without overlap,run megablast : its much faster the regular blast.

Convert fastq to Fasta -->Format as databases --> use fasta file to search formated DB.

To optimize speed and play with options like

-p Identity percentage cut-off

-a Number of processors to use [Integer]

-e Expectation value [Real]

ADD COMMENT
0
Entering edit mode

Do you think clustering the reads that weren't assembled into contigs because of errors during de novo assmebly is a good approach? I think the fact that those reads could be aligned to the contigs but not assembled into the contigs is a sign that using a kmer approach might fail.

ADD REPLY
0
Entering edit mode

some reads even though they are overlapped: they might be skipped (depending on thresholds used) by assemblers as they might not have enough such over lapping reads (coverage): BTW clustering step i suggested above is to reduce the computational time for the megablast runs.

ADD REPLY

Login before adding your answer.

Traffic: 2973 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6