Hi,
I'm really new to bioinformatics and I have a large file with sequences. I only need specific sequences for my analysis. How can I filter them by accession number?
thanks!
Hi,
I'm really new to bioinformatics and I have a large file with sequences. I only need specific sequences for my analysis. How can I filter them by accession number?
thanks!
If your sequences are only one line you can use this command:
cat IDs.txt | while read line ; do grep -A 1 "${line}" inputfile.fasta >> outputfile.fasta ; done
This command only works when accession numbers do not overlap. Also, please make sure IDs.txt doesn't have any empty lines. And, each line in IDs.txt should have one and only one Accession number, with no extra space.
If your sequences are multi-liner you can convert them to a one-liner fasta file first and then use the above command:
awk '/^>/ {printf("\n%s\n",$0);next; } { printf("%s",$0);} END {printf("\n");}' < file.fasta >> inputfile.fasta
When IDs overlap, --perl-regex and $ or \t or other delimiters can be added (depending on the format of the header)
grep -A 1 --perl-regex "${line}$"
grep -A 1
extracts the line containing the pattern, and the line after that
Use this solution: C: How do I extract Fasta Sequences based on a list of IDs?
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
In this line, which file is meant to be "file.fasta"? The output file?
file.fasta
is your input file.Yes, I called the output file
inputfile.fasta
, because it will be used as input in the other command.