How to run different fasta files in a same process with its output respective
0
0
Entering edit mode
3.8 years ago
diego1530 ▴ 80

Dear Biostars,

I have this script based on bash to count the ambiguous nucleotides

input="arabidopsis.fasta"

output="ara.fasta"

while IFS= read line
do
echo $line | grep -v '>' | grep -o "[WSKMYRVHDBN]" | sort | uniq -c
echo $line | grep '>'
done < $input1 >> $output

However, as you can see, my script only read one input and in the same folder I have two more multifasta files. My desire is that all three files read at the same time, run the same process and as a result are three different outputs that correspond to each input. I would greatly appreciate any help in modifying my script to achieve my goal

bash sequence linux • 941 views
ADD COMMENT
0
Entering edit mode

This is because you have supplied only one input file in your code. Try Something like this in the directory where fasta files are located:

find . -type f -name "*.fa" | while read line; do  grep \> $line > ${line%.fa}.counts.txt ; grep -v \> $line | grep -o "[WSKMYRVHDBN]" | sort | uniq -c>>${line%.fa}.counts.txt; done

This would create new file for each fasta file and new extension ".counts.txt". (for eg. test.fa would generate test.counts.txt". Please format the output.

Suggestions for this loop:

  1. catch the input files in array, then loop over the array or list the files and loop over the list
  2. Do not hard code input or output unless necessary.

There is another way with seqkit, but it needs a little home work:

$ seqkit fx2tab -H -nl -B W -B S -B K -B M -B Y -B R -B V -B H -B D -B B -B N *.fa 

#name   length  W   S   K   M   Y   R   V   H   D   B   N
seq10   4   0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00
seq20   4   25.00   0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00
seq1    4   0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00
seq2    8   12.50   12.50   0.00    0.00    12.50   12.50   0.00    0.00    0.00    0.00    0.00

All the values are expressed as % against total length of the sequence. You need to convert each value into numbers from percentage and length of the sequence, for each base

ADD REPLY
0
Entering edit mode

Hi, I really appreciate your help and I inform you that your script works very well. Although the results of the ambiguous nucleotides in each file are shown at the bottom without mentioning which sequence had that nucleotide. Is it possible to show me the ambiguous nucleotides in each sequence? I used the first method. Thanks!

ADD REPLY

Login before adding your answer.

Traffic: 1030 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6