Extending ends of sequences with the help of reads?
3
3
Entering edit mode
9.4 years ago
manekineko ▴ 150

Hi,

I have some hundreds of sequences and library of reads, How is possible to extend these sequences with the reads from the both end of the sequences? But I do not want the software to merge the sequences as they are separate things, just to extend them with the help of the reads as much as possible?

Assembly • 6.7k views
ADD COMMENT
0
Entering edit mode

What sort of format would you like the output to be? I assume you're using SAM or BAM as input.

ADD REPLY
0
Entering edit mode

I have the input sequences want to extend FASTA, reads in FASTA.

(I also have the sequences GFF and the reads mapped in BAM if needed).

I would like to have the extended sequences fasta and/or GFF.

ADD REPLY
0
Entering edit mode

Ah, now I understand what you're trying to do. Stated somewhat more clearly, you have incomplete contigs (presumably from another attempt at assembly) and want to extend them given an alignment to them. Perhaps you also want to do some scaffolding but that's unclear. If that's the case, you might mention that and one of the more assembly-experienced folks here can chime in.

ADD REPLY
0
Entering edit mode

A bit more different - trying to assemble an mRNA transcripts as I have dome part of each one (somewhere in the middle of it) and have a lots of reads

ADD REPLY
10
Entering edit mode
9.4 years ago

I recently developed an assembler, Tadpole, for this problem. It's part of the BBMap package. Usage:

tadpole.sh in=sequences.fa extra=reads.fq out=extended.fa extendleft=200 extendright=200 ibb=f mode=extend

It will extend the sequences using kmer counts. You can set the length to extend as long as you want, but that's just an upper limit; it will stop earlier if it hits a branch.

(edit: modified command line to include mode=extend which I had forgotten)

(edit2: modified for updated syntax in v37.33+)

ADD COMMENT
0
Entering edit mode

Hi, seems what I need, is it gona working if my reads are not fastq but fasta? And if each sequence can be extended differently depending on reads, how to figure out the extendleft and extendright values?

ADD REPLY
1
Entering edit mode

It accepts fasta and fastq. And you can just set the extendleft and extendright numbers to something high like 2000 if you want. They will only be extended until a branch is encountered in the DeBruijn graph, which depends on the organism. So, if you set them to 10, almost all the sequences will be extended by 10bp. If you set it to 1 million, none of them will be extended to one million - they'll all stop somewhere before that, since the extension will only continue as long as there is a single unambiguous best path (according to the thresholds you set). Therefore, just set them to a number X such that you don't want anything to extend more than X.

ADD REPLY
0
Entering edit mode

Is this possible to use long-reads in tadpole?

ADD REPLY
0
Entering edit mode

Hi all, I need to clarify this. The syntax has changed a little for extending existing contigs using new reads... I'll update it later today.

ADD REPLY
0
Entering edit mode

Hi! I have two sets of reads: mit_1.fastq and mit_2.fastq and I want to use them to extend my contigs.

I want to use tadpole but the parameter extra=reads.fq confuses me. How should I put my two sets of reads? (mit_1.fastq and mit_2.fastq)

Perhaps: extra=mit_1.fastq,mit_2.fastq?

Thank you in advance for you help :)

ADD REPLY
0
Entering edit mode

I am using this tool this week. May I ask you if there is a way to make it dump the unused reads in the mode=extend? I saw that there is a outd option for "Write discarded reads, if using junk-removal flags", but I'm not sure that's what I am actually looking forward. I'd like to retrieve a FASTA file with reads that have not been used to extend the template.

ADD REPLY
0
Entering edit mode

Hi, how long is your program supposed to take for giving results? I started it 2 weeks ago, and it is still running and the computer is frozen… I don't know if I should let it run or if I should press the reset button…

ADD REPLY
3
Entering edit mode
9.4 years ago

Interesting challenge, I just wrote a tool (https://github.com/lindenb/jvarkit/wiki/ExtendReferenceWithReads) that takes as input

  • an indexed fasta REFERENCE sequence
  • some indexed BAM

the (clipped) overlapping reads are used to extend the REF sequence in 5', 3' and in the contigs containing 'N'.

$  java   -jar dist/extendrefwithreads.jar \
     -R human_g1k_v37.fasta -f 0.3 \
     f1.bam f2.bam f3.bam 2> /dev/null |\
  cat -n | grep -E '(>|[atgc])' 

     1  >1
   167  NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNncgattaccctaacgctcac
   168  cctaaccctcnccctntnccnncnncccnncttcttccgaTAACCCTAACCCTAACCCTA
  3791  NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNatt
  3792  tatgcNctttntgctgtGATTCATGGCTGAAATCGTGTTTGACCAGCTATGTGTGTCTCT
  8691  NNNNNNNNNNNNNNNNNNNNNNNNctagGATCCTTGAAGCGCCCCCAAGGGCATCTTCTC
 64089  TGGTGAGGGAAATTAGAACCACGACAATTTGGGAACTTAGCTTCTGCCctgctccNNNNN
 66589  NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNgagtAGCTGAGACTAC
ADD COMMENT
1
Entering edit mode
9.4 years ago
h.mon 35k

You may also try Mapsembler2.

ADD COMMENT

Login before adding your answer.

Traffic: 1779 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6