Question

How To Trim An Arb Reference Database To A Specific 16S Region

1

Entering edit mode

11.6 years ago

samlambrechts299 ▴ 170

Hi everybody,

I was wondering if it is possible to trim a 16S reference database in arb format, like say for example the SILVA SSU Ref database?

I know how to trim reference databases in fasta format (Taxman etc), but I can't figure out how to transform it to arb format, or do it with the arb files (.arb extension)

Every help or hints are greatly appreciated!

Kind regards,

a desperate master student

database trimming phylogenetics metagenomics • 6.4k views

ADD COMMENT • link updated 3.8 years ago by 1731147b • 0 • written 11.6 years ago by samlambrechts299 ▴ 170

0

Entering edit mode

Additional information:

My goal is to do downstream phylogenetic analysis of pyrosequencing amplicons. But since phylogenetic comparisons of the short pyrosequencing reads with the nearly-full-length sequences in existing arb format reference databases (for example the SILVA SSU Ref database) result in completely wrong phylogenetic placement, I want to try phylogenetic analysis with reference sequences truncated to fit the length of my pyrosequencing reads. I know how to trim reference sequences to fit the length of my pyrosequencing reads, but a first problem is the conversion of a fasta file (with the truncated reference sequences) to a file with arb extension. Since we are talking about tens of thousands of sequences, manually changing each line in the text file is not an option. A second problem is that the arb program needs a tree file in arb format next to the bare sequences in arb format...

If what I am trying to do is completely impossible, please also tell me

Thank you for any help or hints you would be able to provide

ADD REPLY • link 11.6 years ago by samlambrechts299 ▴ 170

0

Entering edit mode

How I can to trim reference databases in fasta format? I need the regions v3 and v4 of SILVA database 138 Best regards, A desperate Ph student

ADD REPLY • link 3.8 years ago by 1731147b • 0

score 2 · Answer 1 · 2013-04-17

I assume you are doing this for amplicon (metagenomics) or downstream phylogenetic analysis? You can use the ARB platform to convert the files to FASTA format for your database or matrix construction. The .arb file format is a binary file, so you could try to convert it or figure out how to parse it if you need the data in this format. I think it's just easiest to download the data from their FTP in FASTA format, they provide both their own format (ARB) and the FASTA format and the data is the same.

UPDATE (based on additional comment):

RE: completely wrong phylogenetic placement

I'm unclear why you are having problems with placing your sequences onto the reference database using phylogenetic methods. Trimming may help "refine" your phylogenetic placement, but I would first focus on looking at your alignment and making sure you are comparing homologous regions. For example, if you are looking at the 16S rRNA (SSU) sequence, are you certain you are using the correct region? If you can not infer homology in your data matrix then no amount of trimming or editing is going to help you. Once you have a aligned data matrix and you can see that your sequences are homologous, then it may help you to trim the data matrix.

RE: first problem is the conversion of a fasta file (with the truncated reference sequences) to a file with arb extension. A second problem is that the arb program needs a tree file in arb format next to the bare sequences in arb format...

I'm a little confused why you feel like you have to use the ARB platform. There are very well developed methods for working with sequence files in text format. As I mentioned previously, I think it would be in your best interest to use the FASTA files from the SILVA database instead of using the ARB platform and file format. Yes, ARB is set up to use the binary database files in their own platform, but using the FASTA files, aligning with a commonly used program (I typically use MUSCLE), and then using phylogenetic methods or a amplicon sequencing method which is phylogeneticly based (I like TopiaryExplorer) will get you a lot farther than using a self-contained system such as ARB. Bacterial Phylogeny that briefly describes my typical phylogenetic workflow.

RE: Since we are talking about tens of thousands of sequences, manually changing each line in the text file is not an option.

There are easier ways of doing things than "manually changing each line": That is why we are here, to help you learn how to trim thousands of FASTA files in seconds and not spend months at a time manually trimming them and also possibly making errors along the way (because you'll get sick and tired of editing all those sequences and lose concentration). Trim The Fasta Title.