Hi everyone!
I would like to know how I can extract the 1000 last nucleotides from sequence in a FASTA file and their headers using linux command line.
There are about 2000 sequence in this FASTA file and need to extract the last 1000 nucleotides from each sequence keeping their original headers.
I am a beginner at bioinformatics and I am still learning how to manipulate FASTA files, and I'm stuck with this problem and I can not solve it.
Thanks!
seqkit subseq supports this, if you want a fast solution.
If you want to learn programming, write some Python scripts using Biopython.
See this post, which uses
awk
to do some similar: https://stackoverflow.com/questions/30264809/trim-first-n-bases-in-multi-fasta-file-with-awk-and-print-with-max-width-formatThe command in the last code block should be somewhat easy to modify. Instead of using the left_trim argument to trim the first 1000 bases as in the example, you could modify the script to keep the last 1000 bases (something like replacing
start=left_trim+1;
forstrat=length(rec) - left_trim + 1
). Try it out and test on some examples.I could do it! I'm very thankful for your help.
Could you please paste here the explanation that you did about sed syntax?
I removed it because it didn't handle sequences that spanned multiple lines, which is common for FASTA files.
sed
works line by line, so awk was more appropriate in that use case. But the command wascat file.txt | sed 's|^[^>].*\(.{1000}\)|\1|g
, if you're intrested. Briefly,sed 's|PATTERN|REPLACEMENT|g'
will match lines that havePATTERN
and substitute it withREPLACEMENT
. The pattern is a regular expression^[^>].*\(.{1000}\)
, which means from the a line that begins (^
) with anything except a>
(^[^>]
) followed by anything of unlimited length.*
until you can't match anything but 1000 characters of anything (.{1000}
, and see greedy matching which applies here.). The last part is wrapped in parenthesis (\(.{1000}\)
, or(GROUP)
) whereGROUP
is a part of the regex you want to keep for later. TheREPLACEMENT
part just says "replace with group 1 " (\1
). The backslashes before{ } ( )
are there to say you don't want to match those characters, instead they are part of the regular expression syntax. I'd highly recommend learning more aboutsed
and regexes for bioinformatics, they're really useful and widely used.