Filtering a fasta file based on blast results file
3
0
Entering edit mode
2.2 years ago
Bruna • 0

hi! I'm fairly new at all this and have been trying to filter a fasta file based on blast results file and I'm having a lot of trouble doing so, as I don't really understand how to code this. i have file 1 in this format:

>SampleName11.16891;size=8308
GATGAAGAACGCAGCGAAATGCGATAAGTAATGTGAATTGCAGAATTCAGTGAATCATCGAATCTTTGAACGCACATTGC
GCCCCTTGGTATTCCGGGGGGCATGCCTGTTCGAGCGTCATTACAACCCTCAAGCTCTGCTTGGAATTGGGCACCGTCCT
CACTGCGGACGCGCCTCAAAGACCTCGGCGGTGGCTGTTCAGCCCTCAAGCGTAGTAGAATACACCTCGCTTTGGAGCGG
TTGGCGTCGCCCGCCGGACGAACCTTCTGAACTTTTCTCAAGGTTGACCTCGGATCAGGTAGGGATACCCGCTGAACTTA
AGCATATCAATAAGCGGAGGAAAAGAAACCAACAGGGATTGCCTTAGTAACGGCGAGTGAAGCGGCAACAG

>SampleName10.739;size=7844
GATGAAGAACGTAGCGAAATGCGATAAGTAATGTGAATTGCAGAATTCAGTGAATCATCGAATCTTTGAACGCACATTGC
GCCCCTTGGTATTCCGGGGGGCATGCCTGTTCGAGCGTCATTACAACCCTCAAGCTCTGCTTGGAATTGGGCACCGTCCT
CACTGCGGACGCGCCTCAAAGACCTCGGCGGTGGCTGTTCAGCCCTCAAGCGTAGTAGAATACACCTCGCTTTGGAGCGG
TTGGCGTCGCCCGCCGGACGAACCTTCTGAACTTTTCTCAAGGTTGACCTCGGATCAGGTAGGGATACCCGCTGAACTTA
AGCATATCAATAAGCGGAGGAAAAGAAACCAACAGGGATTGCCTTAGTAACGGCGAGTGAAGCGGCAACAG
...

And then file 2 in this:

SampleName11.4501;size=7446 Fungi_sp|MT237019|SH3568601.08FU|reps_singleton|k__Fungi;p__unidentified;c__unidentified;o__unidentified;f__unidentified;g__unidentified;s__Fungi_sp    80.7
SampleName11.1591;size=7622 Lasiodiplodia_citricola|GU945354|SH2131013.08FU|refs|k__Fungi;p__Ascomycota;c__Dothideomycetes;o__Botryosphaeriales;f__Botryosphaeriaceae;g__Lasiodiplodia;s__Lasiodiplodia_citricola   100.0
SampleName11.1591;size=7622 Fungi_sp|MT237019|SH3568601.08FU|reps_singleton|k__Fungi;p__unidentified;c__unidentified;o__unidentified;f__unidentified;g__unidentified;s__Fungi_sp    81.0
SampleName10.611;size=6979  Lasiodiplodia_citricola|GU945354|SH2131013.08FU|refs|

And I would like to filter file 1 to only have the samples (sequence and name) that appear on file 2, do any of you could give me some lights on how to do that? Thank you

fasta biopython • 1.4k views
ADD COMMENT
4
Entering edit mode
2.2 years ago

You can first open file 2 and read sequence ids into a set. Then, open file 1 (with FASTA sequences) and loop over the records to check if the sequence id is present in the set. If it is present, then write the record to the output file.

from Bio import SeqIO

IN_FILE1 = 'file1.fasta'
IN_FILE2 = 'file2.txt'
OUT_FILE = 'out.fasta'

seq_ids = set()
with open(IN_FILE2) as fh:
    for line in fh:
        seq_id = line.split()[0]
        seq_ids.add(seq_id)

with open(OUT_FILE, 'w') as oh:
    for seq_record in SeqIO.parse(IN_FILE1, 'fasta')
        if seq_record.id in seq_ids:
            oh.write(seq_record.format('fasta'))
ADD COMMENT
0
Entering edit mode

thank you very much, I'm gonna try that!

ADD REPLY
0
Entering edit mode

Update: That worked very well. Thank you so much :)

ADD REPLY
1
Entering edit mode
2.2 years ago
iraun 6.2k

Hi! filterbyname.sh script from BBMap toolkit could do the trick.

filterbyname.sh -Xmx10g in=input.fa out=output.fa names=ids.txt include=t substring=t

Before running the command you should probably clean a bit your file 2, and generate a file (in the previous command called ids.txt) containing just the sequence ids:

cut -d';' -f1 file2.txt > ids.txt
ADD COMMENT
0
Entering edit mode

thank you! I'll start by cleaning file two so I only have the ids

ADD REPLY
0
Entering edit mode
2.2 years ago
acvill ▴ 350

Here's a solution using awk and grep:

while read hit; do 
  awk 'BEGIN {RS=">";FS="\n";OFS=""} NR>1 {print ">"$1; $1=""; print}' test.fasta | \
  grep -A1 ">${hit}$"
done < <(awk -F'\t' '{print $1}' blast.txt | sort | uniq)

Assuming blast.txt is tab-delimited, this first extracts the unique sequence names from the first field of the blast results. Then, the hits are processed in a while read loop, extracting deflines and sequences from the unwrapped fasta file test.fasta. This will be inefficient for larger fasta files, since the fasta file is unwrapped again with every iteration of the loop, though I like this solution because you don't have to preprocess the inputs.

ADD COMMENT

Login before adding your answer.

Traffic: 1877 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6