I loop on the sequences and save their kmers in a list and if any of the reverse-complement of kmers exist in any of the sequences then that sequence should be rejected. With that being said- there is a very low chance for the bottom sequences to get past the filter since the number of kmers increases exponentially. With reversing the order I want to apply the filter on the bottom sequences first.
Another way this can be done is to read the FASTA file from its end to its start.
Reading the file backwards avoids having to use hash tables or sorting, which can be expensive on very large inputs. This approach will be fast and use very little memory:
#!/usr/bin/env python
import sys
import os
ifn = sys.argv[1] if len(sys.argv) == 2 else None
def reversed_lines(file):
"Generate the lines of file in reverse order."
part = ''
for block in reversed_blocks(file):
for c in reversed(block):
if c == '\n' and part:
yield part[::-1]
part = ''
part += c
if part: yield part[::-1]
def reversed_blocks(file, blocksize=4096):
"Generate blocks of file's contents in reverse order."
file.seek(0, os.SEEK_END)
here = file.tell()
while 0 < here:
delta = min(blocksize, here)
here -= delta
file.seek(here, os.SEEK_SET)
yield file.read(delta)
def main():
ifh = sys.stdin if not ifn else open(ifn, "r")
lines = reversed_lines(ifh)
sequence = ""
header = ""
for line in lines:
if line.startswith('>'):
header = line.rstrip()
sys.stdout.write('{}\n{}\n'.format(header, sequence))
sequence = ""
else:
sequence = line.rstrip() + sequence
ifh.close()
if __name__ == "__main__":
main()
Usage:
$ ./reverse_fasta_records.py < in.fa > out.fa
Or:
$ ./reverse_fasta_records.py in.fa > out.fa
Edit: Modified this script to support the input file either coming from standard input or a regular file.
Those examples show uses of > within the header line, but record separators appear to either use foo.startswith('>') or the regex ^> to look for the delimiter at the start of a line, no?
This is not
reverse the order of lines in a fasta file
but ratherreverse the order of **records** in a fasta file
.Can you tell us why you want to do this?
I loop on the sequences and save their kmers in a list and if any of the reverse-complement of kmers exist in any of the sequences then that sequence should be rejected. With that being said- there is a very low chance for the bottom sequences to get past the filter since the number of kmers increases exponentially. With reversing the order I want to apply the filter on the bottom sequences first.