How to extract the last 1000 nt from a group of sequences in a FASTA file?
3
0
Entering edit mode
5.8 years ago
Andrei ▴ 10

Hi everyone!

I would like to know how I can extract the 1000 last nucleotides from sequence in a FASTA file and their headers using linux command line.

There are about 2000 sequence in this FASTA file and need to extract the last 1000 nucleotides from each sequence keeping their original headers.

I am a beginner at bioinformatics and I am still learning how to manipulate FASTA files, and I'm stuck with this problem and I can not solve it.

Thanks!

sequence • 5.0k views
ADD COMMENT
2
Entering edit mode

seqkit subseq supports this, if you want a fast solution.

seqkit subseq -r -1000:-1 seqs.fa > result.fa

If you want to learn programming, write some Python scripts using Biopython.

ADD REPLY
1
Entering edit mode

See this post, which uses awk to do some similar: https://stackoverflow.com/questions/30264809/trim-first-n-bases-in-multi-fasta-file-with-awk-and-print-with-max-width-format

The command in the last code block should be somewhat easy to modify. Instead of using the left_trim argument to trim the first 1000 bases as in the example, you could modify the script to keep the last 1000 bases (something like replacing start=left_trim+1; for strat=length(rec) - left_trim + 1). Try it out and test on some examples.

ADD REPLY
1
Entering edit mode

I could do it! I'm very thankful for your help.

ADD REPLY
0
Entering edit mode

Could you please paste here the explanation that you did about sed syntax?

ADD REPLY
1
Entering edit mode

I removed it because it didn't handle sequences that spanned multiple lines, which is common for FASTA files. sed works line by line, so awk was more appropriate in that use case. But the command was cat file.txt | sed 's|^[^>].*\(.{1000}\)|\1|g, if you're intrested. Briefly, sed 's|PATTERN|REPLACEMENT|g' will match lines that have PATTERN and substitute it with REPLACEMENT. The pattern is a regular expression ^[^>].*\(.{1000}\), which means from the a line that begins (^) with anything except a > (^[^>]) followed by anything of unlimited length .* until you can't match anything but 1000 characters of anything (.{1000}, and see greedy matching which applies here.). The last part is wrapped in parenthesis (\(.{1000}\), or (GROUP)) where GROUP is a part of the regex you want to keep for later. The REPLACEMENT part just says "replace with group 1 " (\1). The backslashes before { } ( ) are there to say you don't want to match those characters, instead they are part of the regular expression syntax. I'd highly recommend learning more about sed and regexes for bioinformatics, they're really useful and widely used.

ADD REPLY
1
Entering edit mode
5.8 years ago
5heikki 11k

This requires Biopython. Save it as extractEnds.py and chmod +x extractEnds.py to make it executable and then just run it

#!/usr/bin/python

import sys
from sys import argv
from Bio import SeqIO

if (len(sys.argv) < 2):
    print("Usage: extractEnds.py file.fna")
    sys.exit()
else:
    FNAME = argv[1]
    IFILE = open(FNAME, "rU")

for record in SeqIO.parse(IFILE, "fasta"):
    print(">%s\n%s" % (record.description, record.seq[-1000:]))

IFILE.close()

I'm quite a Python noob so if there's a way to make this better or there are mistakes, please let me know..

ADD COMMENT
2
Entering edit mode

No need for open(FNAME, "rU") or IFILE.close(). SeqIO handles all opening and closing itself.

You can modify this simply to:

for record in SeqIO.parse(sys.argv[2], "fasta"):
ADD REPLY
1
Entering edit mode

Thanks. What's the difference between (argv[1], ..) and (sys.argv[1], ..)?

edit. nevermind, found it, it just depends on what gets imported in the start..

edit2. For years I sticked with epic GNU coreutil oneliners, but I'm starting to see the advantages of biopython. Especially if you write scripts that someone else inherits some day.. they will be much happier if you used biopython

ADD REPLY
3
Entering edit mode

If you want it to be robust, or you want the code to be used by anyone but yourself (especially if they're a novice), a dedicated parser like BioPython which covers lots and lots of edge cases is worth its weight in gold.

You could write that code in a reasonably concise one-liner too (this one also has the benefit of taking a second CLI arg for any length of sequence):

python3 -c "import sys; from Bio import SeqIO; [print(f'>{r.id}\n{r.seq[-int(sys.argv[2]):]}') for r in SeqIO.parse(sys.argv[1],'fasta')] ;" seqs.fa 1000

(Python 3 only because of a print-inside-list-comprehension and use of fstrings.
ADD REPLY
1
Entering edit mode
5.8 years ago

linearize and use a second awk to extract the last 100 characters.

awk '/^>/ {printf("%s%s\t",(N>0?"\n":""),$0);N++;next;} {printf("%s",$0);} END {printf("\n");}'  input.fasta |\
awk -F '\t' '{x=100;L=length($2);printf("%s\n%s\n",$1,(L<=x?$2:substr($2,1+L-x,x)));}'
ADD COMMENT

Login before adding your answer.

Traffic: 2168 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6