Hi everyone
Is there a fast way to do this filter?
I have a huge Fasta file (sequences are short reads coming from an Illumina instrument). I have also a list of nucleotide sequences (not Fasta, just the sequences) and I want to remove from the big Fasta file all entries identical to those in the list.
My idea was simply to go down through the Fasta file and then, for every read, check all the sequences of the list. If the read matches one of the sequences then do nothing, otherwise print the read into a new file. I made this with perl but it takes ages!
The list is made up of nucleotide sequences, not IDs. It's something like this:
AACGACTACTTATCGATC
TCGGCGATATACGTAC
CCAGTTTCGGGGCTAT ....
Thanks!
can you make a better example... you have a file with a lot of sequences, and a list of sequence ids to remove from it. Is it correct?
I'm fairly sure this has been asked/answered before - check the "Related" box or search the archives.
Yes, except that the list is not made up of IDs. They are nucleotide sequences.
Do I understand correctly that you have a list of nucleotide sequences, and you want to go through a FASTA file and remove all entries that contain an exact, full-length hit to one of the nucleotide sequences in your list? Are the nucleotide sequences in your list by any chance all of the same length?
Have a look at these two questions:
Also, make a search in the archives.