Hi,
So, I managed to extend my elements by a long manual method (Blast and assembly), I do not have to rush anymore, but I still would like to ba able to get an assembly on a sequence from reads. So, I give the results I obtain:
Initial sequence:
>2-1-u Leymus arenarius predicted callose synthase
AGGAGCTGTACATCGAGAAGCAAAGGTTAAGACACCTCCGGCAGTGGGTCTCATACAGGGGACAGACGCTATCCCGCACGGTGAGGGGGATGATGTATTACTATGAAGCTCTCAAGATGCTGACTTTTCTTGAATCCGCATCTGAACGTCGATTGATTACTGGATCGAGGAAGATCTCTTCAAGACCGGATGGGGGTGGGGCATCTTCTTCACGTGCACCACCACTGACACTGAGCAGAGCAAGTAGTGGGGTTGGTGTGAGCTCCCTGTTTAAAGGTAGCGAGTATCGGACTGCCCTCATGAAATACACTTATGTGGTTGCATGCCAACTTTATGGTCAGCAGAAAGCTAGAAATGACCCCCATGCTGTTGAGATATTGGAGCTAATGAAGAAGTATGAGGCACTGCGCGTCGCTTATGTTGATGAAAAGCACTCCAGCGGTGGTGAAACAGAGTACTTCTCTGTCCTTGTGAAATACGACCAGCTGTTTCAGCGGGAGGTCGAGATCTACCGGGTTAAGCTGCCTGGACCGTTGAAGCTTGGTGAAGGCAAGCCAGAGAACCAGAACCATGCACTCATCTTCACGAGGGGTGATGCCGTTCAGACCATCGACATGAACCAAGACAACTACTTTGAAGAAGCTCTTAAGATGAGAAACCTGCTTGAAGAGTTCAATCGCTATAGCAAACCAAATATCCCTGGAGTTCGGGAGCATGTGTTCACCGGTTCTGTGTCTTCCCTAGCTTGGTTTATGTCTGCCCAAGAAACCTGTTTCGTCACTCTGAGCCAGCGCGTTCTTGCTAATCTACTCAAGGTCAGAATGCATTATGGCCACCCGGATGTGTTTGACCGCCTTTGGTTCTTGGCCGAGGT
result of: run_mapsembler2_pipeline.sh -s input.fa -r ~/Desktop/output.fasta -t 2
>2-1-u
AGGAGCTGTACATCGAGAAGCAAAGGTTAAGACACCTCCGGCAGTGGGTCTCATACAGGGGACAGACGCTATCCCGCACGGTGAGGGGGATGATGTATTACTATGAAGCTCTCAAGATGCTGACTTTTCTTGAATCCGCATCTGAACGTCGATTGATTACTGGATCGAGGAAGATCTCTTCAAGACCGGATGGGGGTGGGGCATCTTCTTCACGTGCACCACCACTGACACTGAGCAGAGCAAGTAGTGGGGTTGGTGTGAGCTCCCTGTTTAAAGGTAGCGAGTATCGGACTGCCCTCATGAAATACACTTATGTGGTTGCATGCCAACTTTATGGTCAGCAGAAAGCTAGAAATGACCCCCATGCTGTTGAGATATTGGAGCTAATGAAGAAGTATGAGGCACTGCGCGTCGCTTATGTTGATGAAAAGCACTCCAGCGGTGGTGAAACAGAGTACTTCTCTGTCCTTGTGAAATACGACCAGCTGTTTCAGCGGGAGGTCGAGATCTACCGGGTTAAGCTGCCTGGACCGTTGAAGCTTGGTGAAGGCAAGCCAGAGAACCAGAACCATGCACTCATCTTCACGAGGGGTGATGCCGTTCAGACCATCGACATGAACCAAGACAACTACTTTGAAGAAGCTCTTAAGATGAGAAACCTGCTTGAAGAGTTCAATCGCTATAGCAAACCAAATATCCCTGGAGTTCGGGAGCATGTGTTCACCGGTTCTGTGTCTTCCCTAGCTTGGTTTATGTCTGCCCAAGAAACCTGTTTCGTCACTCTGAGCCAGCGCGTTCTTGCTAATCTACTCAAGGTCAGAATGCATTATGGCCACCCGGATGTGTTTGACCGCCTTTGGTTCTTGGCCGAGGT
and results of tadpole.sh in=input.fra extra=~/Desktop/part-of-sequences/Leymus-all.fasta out=extended-seq.fa extendleft=500 extendright=500 ibb=f mode=extend
>2-1-u Leymus arenarius predicted callose synthase
AGGAGCTGTACATCGAGAAGCAAAGGTTAAGACACCTCCGGCAGTGGGTCTCATACAGGGGACAGACGCT
ATCCCGCACGGTGAGGGGGATGATGTATTACTATGAAGCTCTCAAGATGCTGACTTTTCTTGAATCCGCA
TCTGAACGTCGATTGATTACTGGATCGAGGAAGATCTCTTCAAGACCGGATGGGGGTGGGGCATCTTCTT
CACGTGCACCACCACTGACACTGAGCAGAGCAAGTAGTGGGGTTGGTGTGAGCTCCCTGTTTAAAGGTAG
CGAGTATCGGACTGCCCTCATGAAATACACTTATGTGGTTGCATGCCAACTTTATGGTCAGCAGAAAGCT
AGAAATGACCCCCATGCTGTTGAGATATTGGAGCTAATGAAGAAGTATGAGGCACTGCGCGTCGCTTATG
TTGATGAAAAGCACTCCAGCGGTGGTGAAACAGAGTACTTCTCTGTCCTTGTGAAATACGACCAGCTGTT
TCAGCGGGAGGTCGAGATCTACCGGGTTAAGCTGCCTGGACCGTTGAAGCTTGGTGAAGGCAAGCCAGAG
AACCAGAACCATGCACTCATCTTCACGAGGGGTGATGCCGTTCAGACCATCGACATGAACCAAGACAACT
ACTTTGAAGAAGCTCTTAAGATGAGAAACCTGCTTGAAGAGTTCAATCGCTATAGCAAACCAAATATCCC
TGGAGTTCGGGAGCATGTGTTCACCGGTTCTGTGTCTTCCCTAGCTTGGTTTATGTCTGCCCAAGAAACC
TGTTTCGTCACTCTGAGCCAGCGCGTTCTTGCTAATCTACTCAAGGTCAGAATGCATTATGGCCACCCGG
ATGTGTTTGACCGCCTTTGGTTCTTGGCCGAGGT
I would like to not that mapsembler, instead of proposing additional ends, proposed alternative ends for the provided sequences..
So the easiest:
Usage: extendrefwithreads [options] Files
So, after instalation, I ran "extendrefwithreads my file" "commande not found", then a "locate extendrefwithreads" gives nothing back…
Then… Mapsembler2, can't tell from home what was the command I put in, for I was university…
I started the command at 10 in the morning… it finished at 16… and just added in the best case 10 pb, with a lot of alternative ends…
Sorry if I'm on the nerves… I came to this forum reading a lot from it after many search that ended here, with contents that was related to my work… but none actually answered! The bioinformaticist of the team I am currently in was supposed to give me primers from the data, but he said "I'won't search on a local scale for I am planing to make a whole assembly of the genome" … he was never fine with an assembly… but did not wanted to "waste time" in just looking for what I needed to… id est only 1000 FBP around a dozen of sequences! He left for holidays, I just have 6 weeks left… and I'm wasting my time in order to obtain results that he could have computed in one night over! Sorry if I may look angry now a day, for I'm a biologist asked to make programming a computer guy hasn't succeed in… Sorry for being sensitive when I ask to someone of my faculty that is paid for computer issues I get as an answer "I don't know, but you should ask to the guy in the 3rd floor that is used to help other PhD student with bioinformatic issues, he may find a solution" … when the 3rd floor guy is actually myself, unpaid PhD student, suffering from a "STD" (Syndrome of Thesis Decay). Sorry… I don't even ask for a GUI… I just ask for a "MagicCommand -input MyData -data FuckingHugeDataBase" that answers me "SuitableResults.fasta"
When I had class of computer science in bachelor degree, I was told "If you feel like doing something pointless, repetitive and boring, someone had already program something to do it automatically before" If you keep doing it, you suffer from the "Woody woodpicker syndrom"! I look for something actually basic… why is it hidden? Why is a function that anyone may need at some point, about impossible to obtain? I know NCBI blast since my first year master degree in 2005, I was advised to use the NCBI workbench in this forum… Why this FS does not have a huge "Blast" bouton? Why when I google Gblast and blast, I just get the description, tellin' me the Gblast do do blast, and nothing about how to get it? Why when I asked on researchgate "what software I could use in order to color part of a 3D structure of a protein", I was told any software, with a list of program I tried in vain, that, indeed, had the function… really hidden somewhere, accecible with a shortcut you had to know first?
When I, biologist, make a OS code, I put a clear description, proof here: https://github.com/giby/Linarius … I expect you, bioinformaticist, to do the same and provide clear instructions, and minimal working exemple on your work… else, no one will ever use your work, but, that's not of my buisiness!
I understand you are stressed, and I sympathize with you for your problems, so I will start with a complete walk-through with tadpole, from downloading to running it:
However, I have to state a couple of points:
1) you are assuming the local bioinformatician is not helping out of spite or laziness, but he may be over-worked, or with some late work his boss is demanding results, and he is stressed like you. You don’t know how many people reach out to him with all kinds of new problems, everybody expecting him to solve quickly because “it is easy” for him.
2) you still didn’t tell exactly the commands you tried.
3) from your other thread ( How to start with blast and a local database ), it seems you have limited computing resources. How much memory, CPUs and disk you have available?
4) also from the other thread, I am not even certain you have an assembled genome, instead I think you have raw sequencing reads. Could you clarify?
Hi, thanks for your answer, on this wonderful sunny saturday, from my lab, here is the command:
1) I don't say he is lazy, I know he is hard working, and competant. I just say he, like many informaticists, want to do things in the most elegant way, that is time and resource costly, and sometime error proof, whereas just making things the crapy way was actually way enough.
295 ./run_mapsembler2_pipeline.sh -s ~/Desktop/Leymus-genome-test/Leymus-seq-clean.fasta -r ~/Desktop/part-of-sequences/Leymus-all.fasta -t 1
3) memory: 15 GiB, 4 processors 3,20GHz 75 GiB availlable space
4) I never said I had an " assembled genome" I have a "complete genome with frangments 300-500 bp" then, it is indeed a lot of reads, enough to cover everything at least twice and that my bioinformatician tried, in vain to assemble, whereas... I only needed a simple thing...
This is a terminology misunderstanding then, I don't know what you mean by "complete genome with frangments 300-500 bp". What you mean is you have sequencing reads? Which technology, seems like Illumina? Strictly speaking, there are no "complete genomes", either when we refer to assembled genomes (always incomplete, even when of very high quality), or sequencing reads.
I woyld try to assemble with MEGAHIT, create a blast db with the "final.contigs.fa" file, and then blast the sequences of interest against this db. MEGAHIT is fast and may be able to assemble a draft genome of sufficient quality in one or two days - but 75Gib of disk space may be too few.
yes, there is 16 GO of reads, that covers normally the whole genome, I think it is illumina, but not sure...
Assembly is an issue in itself, I tried with MECAT, and it said for such an assembly, it would need one TB of RAM to proceed... basically, I want:
let's say my sequences I want to extend are: MFS.fa and the 16GO of reads FHDB.fa
0) It takes the bases of a sequence of MFS.fa =seqE=seqC
1) secE between 10-110 position, blast it in FHDB.fa
if no answer then return secE and secC && STOP
if single answer then Merge answer with secE and secC then GO TO 3
If number of answer >1 then GO TO 2
2) compare answer 1 and 2
If match merge answer1 with secE and answer2 with secC
if mismatch then return secE and secC && STOP
3) compare secE and secC
if mismatch then return secE and secC && STOP
If match GOTO1
I'm not a good coder, I don't know how to program it, and of course it should do it in both direction