I've to align my reads against a very little sequence (~20kb) . And I wonder how the aligner manage the extremity of this sequence ? Are the reads aligned with the same accuracy at the extremity of the sequence as in the middle ?
As a suggestion, why don't you assemble all your reads (maybe as ESTs if this is the case), and then BLAST the contigs against your ~20kb sequence? I'm not sure if this makes sense for your biological question, but it makes much more sense from the algorithmical one.
Unless you specifically know of a reason to worry about those first 100 bases or so, dealing with it is generally not worth your time. If you decide to do remapping of sub-reads for any reason, be aware of the possibility that your 25-bp fragments may erroneously map elsewhere in the genome. (The smaller the sequences, the more locations it will map.)
You can get the ID of unmapped reads from your sam/bam file and grep them from your main fastq (with -A3 grep option) or fasta file (with -A1) or just take the sequence along with read ID's of unmapped reads from sam/bam file and make a new fasta file (may be using awk with sequence storing in next line). For fasta file I used -
perl -nwle '
$i=0;
for my $add (<>=~/.{25}/g) {
printf "%s%s\n%s\n", $_, $i++ ? "#$i":"", $add;
}' inputfile.fasta
This will cut your reads in fasta file and have "#" with every part -eg -
I'm pretty sure there is no tool to do this, but it should be fairly straight-forward to implement if you know a bit of scripting. As a first step you could align your reads with e.g. BWA. Then you extract the reads that didn't map based on the corresponding samtools flag. Those you then chop up and realign. That's where the trouble starts though: Do you only want the first 25bp or all until read-length? Do you want overlapping reads? And most importantly, why do you want to do that? Are you planning to get the most out of your reads? If so, incrementally shorten reads and trying to realign might be a better idea, but it still sounds like overkill.
What's your reason for doing this? I can think of many cases where this is going to be a bad idea, and produce alignments that are just wrong.
I've to align my reads against a very little sequence (~20kb) . And I wonder how the aligner manage the extremity of this sequence ? Are the reads aligned with the same accuracy at the extremity of the sequence as in the middle ?
As a suggestion, why don't you assemble all your reads (maybe as ESTs if this is the case), and then BLAST the contigs against your ~20kb sequence? I'm not sure if this makes sense for your biological question, but it makes much more sense from the algorithmical one.
I work on viral genomes which are much smaller (12kb) and have no trouble whatsoever with aligning short reads against it.
[Deleted comment.]
Unless you specifically know of a reason to worry about those first 100 bases or so, dealing with it is generally not worth your time. If you decide to do remapping of sub-reads for any reason, be aware of the possibility that your 25-bp fragments may erroneously map elsewhere in the genome. (The smaller the sequences, the more locations it will map.)