Hi everyone!
I have a list of orthologous groups that are single-copy genes among different species, e.g:
OG1.5_8000: sp1|Gene.10006::gi sp2|Gene.21810::GDRL01072110.1::g.21810::m.21810 sp3|Gene.49740::GECV01105569.1::g.49740::m.49740 sp4|XP_018429109.1 sp5|Gene.23019::GEGJ sp6|TRINITY_DN146402_c0_g3::TRINITY_DN146402_c0_g3_i2::g.23453::m.23453
I have a FASTA file (database) that contains protein sequences for each of those genes, e.g;
>sp1|Gene.10006::gi
HIEEEERNMAPERILITGCNRGIGYELVRQLIERRAPPNQIFATCRDPSSPQNKKLIDLA
ARYPKVVVIKLETTDANSIKEAVKVVEKHLNGAGLNLLINNAGIMPDSNLESADSADFLN
VYNTNVVGPFLVAKELLPFLKKAAAENPRKPLSCSKSAIINISTLLGSIEKTPETFSAYP
I'm using the following bash script to extract the sequences for all orthologous groups:
#!/bin/bash
# This is a bash script that extracts the sequences for all orthologous groups (OG).
# It takes the a OG ids list as input and saves all sequences belonging to that group
# from all organism in a file named with OG group in fasta format.
# Note that after the script is executed, there will be 'n' number of files (where
# n=total number of OG's in the input list
# Arun Seetharam <aseetharam@purdue.edu>
scriptName="${0##*/}"
outdir=$(pwd)
function printUsage() {
cat <<EOF
Synopsis
$scriptName [-h | --help] [-o dir_name] input_ids_list database
Description
Extracts sequences for all ortholog groups supplied as list. For each ID in the list
a file containing FASTA sequences will be generated, whcih belong to that OG.
Note: this script requires standalone blast+ software.
input_ids_list
Input list should contain orthologous group IDs one per line
These IDs should be generated by "orthomclMclToGroups" command
database
Absoulute path for the database should be specified. The database is
generally named as 'goodProteins.fasta'.
-o directory_name
directory name to save the output files. By default all files will be saved in
the current directory.
-h, --help
Brings up this help page
Author
Arun Seetharam, Bioinformatics Core, Purdue University.
aseetharam@purdue.edu
EOF
}
if [ $# -lt 1 ] ; then
printUsage
exit 1
fi
while getopts ':o:' option; do
case "$option" in
o) outdir=$OPTARG
shift
;;
h) printUsage
exit
;;
help) printUsage
exit
;;
esac
done
#module use /apps/group/bioinformatics/modules # uncomment these 2 lines if running this on clusters
#module load blast
mkdir -p $outdir
shift $(( $# - 2 ))
file=${1}
pathdbname=${2}
sed -i 's/://g' ${file}
while IFS=$' ' read -r -a myArray
do
for i in "${myArray[@]:0}";
do
blastdbcmd -entry "$i" -db ${pathdbname} >> ${outdir}/${myArray[0]}.fa;
done
done <${file}
However when I run it, somehow, it changes the format of the initial list of orthologous groups to:
OG1.5_8000 sp1|Gene.10006gi sp2|Gene.21810GDRL01072110.1g.21810m.21810 sp3|Gene.49740GECV01105569.1g.49740m.49740 sp4|XP_018429109.1 sp5|Gene.23019GEGJ sp6|TRINITY_DN146402_c0_g3TRINITY_DN146402_c0_g3_i2g.23453m.23453
(colons ":" are ignored).
This causes that BLAST is not able to find the entry in BLAST database and so the script fails. However, I haven't wrote this piece of code and I have no idea on how to modify it to make it work. Could anyone please help me?
Thank you very much in advance!
Hey Santiago, I was looking online for a a script to use it on a similar task you mentioned in this thread. I am trying using your script but it does not work for me. Actually the task I have is I have a list of OGs in a matrix. For example: [ OG1|Family1|D17P2#134peg1993 SCC393#128peg1816 Y4#137peg1426|chars 1-216 ] [ OG2|Family1|S23A#141peg1162 SCC4092#140peg898|chars 217-460 ] I have like 3000 OG in the matrix. I will extract them to be in a file and was thinking to use your script to go through the different OG and get me the CDS of these accession from a database folder having all taxa CDS (either in one file or separate files). The accession number are exactly the same. For example: