I want to make reference table so that I can annotate the antibiotic resistance gene hits with antibiotic resistance gene category name
0
0
Entering edit mode
7.9 years ago

Hi

I am too naive here, sorry first for trivial question. I want to make reference table so that I can annotate the antibiotic resistance gene hits with antibiotic resistance gene category name by using the following commands:

1)making a reference database for annotating the aro group numbers with the antibiotic resistance groups

cat ./aro.obo | tr "\n" "@" | sed 's/@@/\n/g' | grep -v format-version | grep -v Typedef | sed 's/\[Term\]@id\:\s//g' | sed 's/@.*@is_a/\tis_a/' | grep is_a | sed 's/@relationship.*//' | sed 's/is_a.*\!\s//' | sed 's/ /_/g' > ./ARO_numbers_and_AR_groups.tsv

2)Get a list of ARO numbers with their corresponding gene ID numbers and taxonomic associations from fasta 3)The fasta is annotated as a heirarchy so all ARO numbers should be taken

grep '>' AR-polypeptides.fa | sed 's/>//' | sed 's/ARO:1000001//g' |sed 's/\s.*ARO/\tARO/' | sed 's/\ .*\[/\t[/' | sed 's/ /_/g' > ./gene_IDs_and_ARO_numbers_and_AR_groups.tsv

4)Next, merge the files (using awk) into a single reference database

awk 'FNR==NR { a[$1]=$2; next } $2 in a { print a[$2]"\t"$1"\t"$2"\t"$3 }' ./ARO_numbers_and_AR_groups.tsv ./gene_IDs_and_ARO_numbers_and_AR_groups.tsv > ./CARD_annotation_reference.tsv

While I can produce the two outputs from the first and second command, the awk part does not give any output. here are some lines from the first and the second output.

./gene_IDs_and_ARO_numbers_and_AR_groups.tsv
ARO:0000000 antibiotic_molecule
ARO:0000001 antibiotic_molecule@synonym:_"quinolone"_EXACT_[]
ARO:0000002 tetracycline_resistance_gene
ARO:0000003 aminoglycoside@synonym:_"Astromicina"_EXACT_[]@synonym:_"Astromicine

./gene_IDs_and_ARO_numbers_and_AR_groups.tsv
gi|AAA76822.1|ARO:3002654|APH(3')-VIIa  [Campylobacter_jejuni]
gi|ABC26006.1|ARO:3001624|OXA-84    [Acinetobacter_baumannii]
gi|AAF86691.1|ARO:3001816|ACC-2 [Hafnia_alvei]
gi|AFU35065.1|ARO:3003206|lsaE  [Staphylococcus_aureus]
gi|AFM38048.1|ARO:3003206|lsaE  [Staphylococcus_aureus]

Could it be due to the different structure of the files i.e. TSV and | separated?

I appreciate if someone can help me to get it worked.

Regards Mahdi

next-gen sequencing • 1.8k views
ADD COMMENT

Login before adding your answer.

Traffic: 1595 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6