hello humans,
I'm trying to reconstruct a sequence based on a PoolSeq file of a population (in fasta format) and a conserved area. I want to search for matches with this sequence and then build up the neighboring area starting from that conserved sequence.
So I need a command to search for a match and print whatever comes next to it in that read. I need to print all reads, containing that sequence to BLAST them afterwards and find out if its on the gene of my interest. Bit by bit I want to reconstruct that gene (because it apparently differs from the NCBI version).
Can I do this with a Bash command and search a fasta file for a sequence segment and to print the neighboring region of the match in every read?
Input:
20~ bp Sequence
Output:
all reads containing that sequence segment with the neighboring region in that read
Dear Human,
reviewing your post history, you still seem to revolve around the same problem for a few weeks. I am sorry, we couldn't help you solve your problem despite the responses you have received so far. Your problem seems to be reconstructing a sex-specific gene from a pool of sequences (PoolSeq of unknown depth and technology specs, assuming the individuals are barcoded and their sex is known) from the genome or transcriptome of an unspecified species and sample. Your hypothesis is that the sex-specific gene or isoform in that unspecified species is different from the reference sequence and annotation and that the sex difference is detectable from your data. You are further assuming (based on speculation that would require validation first) that a 20bp conserved nucleotide sequence is sufficient and specific genome-wide for detecting the gene in question. Now you are looking for a method to support that hypothesis (even though it has limited if any support so far, at least none was presented).
Please allow me to be frank here: the reason why you haven't succeeded so far is in part because you keep asking the question in the wrong way. First, you have never disclosed all the details in a single post, and secondly and very importantly, you are limiting yourself and us by already proposing and trying to restrict tools and resources. So, try to forget for a moment about BLAST and a simple bash or perl script, because possibly the best solution is nothing like that. Once you are ready to accept that, we can work on the optimal solution to your problem which is possibly very different.
Best
Another human
This is probably not really a task for
bash
. It is theoretically possible, however it will be very clumsy code, even if you're looking for exact substring matches which it sounds like you aren't.I would suggest that this is probably better achieved in
biopython
.I will totally try that! What code approch would you suggest to fullfill this task. If you had to do it, how would you?
I'm not sure I yet fully understand what you're trying to achieve so don't think I can offer an approach, but it sounds like you need to do something like:
I am not so sure, I think this whole alignment thought is an XY problem. I think op should rather look into standard pipelines for assembly and variant calling first before trying to build something non-standard.
After assembly, contigs can be mapped to the reference and reads to contigs to check for eventual differences.
After all, what was proposed by op sounds like a a simple seed based assembly approach, but there is not enough evidence that the 20bp sequence is sufficient nor specific enough to assemble the suspected region of variation. On the other hand the 20bp sequence seems to be important because op keeps bringing it up, so maybe there is prior evidence, e.g. from published variants, but then this needs to be explained.
I think the crux is to try to stick to established methods and protocols first, in particular if you are a beginner.