Remove short sequences with length < 50% from a multi-fasta file
4
0
Entering edit mode
2.8 years ago
b.shrestha • 0

Hi All,

I need to remove sequences that are shorter than 50% of the entire alignment in a multi-fasta file. For example, I want to remove the "Seq3" sequence in the example below, as its length is less than half the length of other sequences. I have many multi-fasta files and I don't want to check them all manually. Any suggestions?

example_fasta

Thank you!

Alignment Fasta Sequence • 3.0k views
ADD COMMENT
3
Entering edit mode
2.8 years ago
Mensur Dlakic ★ 28k

You don't seem to have an alignment, but rather a collection of fasta sequences. If I am interpreting that correctly, seqtk will do the trick:

seqtk seq -L XX file.fas > trimmed.fas

XX in the command above is the cutoff length.

If this is actually a poorly formatted fasta alignment, this script should do the job:

import sys
from Bio import SeqIO

FastaFile = open(sys.argv[1], 'r')
FastaDroppedFile = open(sys.argv[2], 'w')
drop_cutoff = float(sys.argv[3])

if (drop_cutoff > 1) or (drop_cutoff < 0):
    print('\n Sequence drop cutoff must be in 0-1 range !\n')
    sys.exit(1)

for seqs in SeqIO.parse(FastaFile, 'fasta'):
    name = seqs.id
    seq = seqs.seq
    seqLen = len(seqs)
    gap_count = 0
    for z in range(seqLen):
        if seq[z]=='-':
            gap_count += 1
    if (gap_count/float(seqLen)) >= drop_cutoff:
        print(' %s was removed.' % name)
    else:
        SeqIO.write(seqs, FastaDroppedFile, 'fasta')

FastaFile.close()
FastaDroppedFile.close()

Save it as fasta_drop.py and then try:

python fasta_drop.py file.fas trimmed.fas 0.5

This will drop all the sequences that have gaps in >=50% of alignment positions.

ADD COMMENT
0
Entering edit mode

Thanks a lot for the suggestions!

Yes, it's not an alignment rather a collection of fasta sequences that I want to align but only after removing shorter sequences. Sorry about the confusion.

I am familiar with seqtk but for that I need to state cutoff length, which varies for different genes. Proportion instead of fixed cutoff length would be better but couldn't locate proportion in the seqtk manual.

I was able to resolve the issue with running your script. The multi-fasta needs to be aligned before running the script to drop the shorter sequences. Thank you for sharing the script. It was very helpful!

ADD REPLY
2
Entering edit mode
17 months ago
DareDevil ★ 4.3k

You can also try:

from Bio import SeqIO

def filter_sequences(input_file, output_file):
    # Open the input file in read mode and output file in write mode
    with open(input_file, "r") as handle, open(output_file, "w") as output_handle:
        records = list(SeqIO.parse(handle, "fasta"))

        # Calculate the minimum sequence length required
        min_length = 0.5 * max(len(record.seq) for record in records)

        # Filter and write sequences longer than the minimum length to the output file
        for record in records:
            if len(record.seq) >= min_length:
                SeqIO.write(record, output_handle, "fasta")

# Provide the path to your input and output files
input_file = "input.fasta"
output_file = "output.fasta"

# Call the function to filter sequences
filter_sequences(input_file, output_file)
ADD COMMENT
0
Entering edit mode
17 months ago
Hugo ▴ 380

SEDA (https://www.sing-group.org/seda/) includes a "Filtering" operation (https://www.sing-group.org/seda/manual/operations.html#id4) that has a "Remove by size difference" option that allows specifying the desired percentage and the reference secuence to be used in the comparison.

ADD COMMENT
0
Entering edit mode
17 months ago

R version:

library(Biostrings) 
## read your fasta
fa <- readDNAStringSet('your.fasta')
## set removal to 50% of the full alignment length in this case ill use 50
removal.cutoff<- 50 
## remove sequences below 50 nt in length
fa <- fa[width(fa) >= removal.cutoff]
## write a new file with your subet
writeXStringSet(fa,'your_new.fasta',format='fasta')
ADD COMMENT
0
Entering edit mode

Hi,

I am trying to build a maximum likelihood phylogenetic tree in R. I have never done this before and need a step-by-step guide on how to do this in R.

I have my files in fasta format in a folder but don't know how to move forward.

Any suggestions will be appreciated especially the packages that I need to install and how to read in multiple fasta files into R to build my tree to obtain a newick output file.

Thanks

ADD REPLY
0
Entering edit mode

Try FastTree using this Docker image: https://hub.docker.com/r/pegi3s/fasttree/

ADD REPLY

Login before adding your answer.

Traffic: 2077 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6