Hello guys,
I'm new to this community, so firstly I ask you to understand if I make any mistakes regarding how this forum works and also for my english (it's not my first language). So... I have some output files from PGAP (pan-genomes analysis pipeline), this tool allows me to verify the core genome, accessory genes and species-specific genes. I'm currently working with bacteria (3 species from genus Proteus) and I need to get all the sequences from the core genome and make it all into a multiFASTA file (to use as input to Vaxign). The problem is that PGAP doesn't generate FASTA files, instead the output files formats are EMBL, .nuc, .pep and other formats. Can I convert these EMBL files into a multiFASTA file? I tried using seqretsplit to split the EMBL file into multiple EMBL files with one sequence each (to convert them to FASTA and merge them right after), but even after reading the manual I couldn't figure out how to do it properly (it was only making a file with a single sequence). Is there any other way to create a multiFASTA file? Can I use seqretsplit to convert the EMBL file to a multiFASTA file? If so, how do I do that? I tried using online converters but I guess the files are too big. I apologize for my lack of computational skills, I'm new to the bioinformatics field. Thanks in advance!
Are you sure? Was that already in multi-fasta format or did it have just one sequence as you wrote?
The EMBL file had several CDSs, when I tried to convert to multiFASTA I checked the file and it was actually a singlefasta. The command I used was:
Maybe I'm doing this wrong?
I am sure you must have checked the command syntax. File format conversions are always tricky. Is it possible to post a small example of the file you are dealing with?
Here are the first lines of the EMBL file:
You mention you also have .nuc and .pep files. Are those then not the multi-fasta files you're looking for? I'm not familiar with the output of PGAP but it could the EMBL file is a single entry (eg 1 genome or chomosome) with multiple CDS on it. so then you can't split the EMBL file since it is already a single entry. (you could check if there are multiple lines in the PGAP EMBL output file that only contain ' // ' == the record separator).
The .pep files have amino acid sequences but I need a FASTA with nucleic acid sequences, and both .nuc and .pep have just the complete set of genes of each genome I've put to analysis in PGAP, therefore they do not present the core genome. I took the easy way out and asked for a friend's help, he wrote a script to extract and merge some PGAP output files:
1.Orthologs_Cluster.txt (which has the information about the core genome and also duplications);
.nuc files (which has the sequence I need for making a multiFASTA file).
It worked great for extracting the specific infos and merging them into a single file, but the file ended up being deformatted. But I guess I can edit the column size and also edit the header. I'm trying to figure it out. This is a way that I've found to try to make a multiFASTA, but it still takes a lot of time. Unfortunately the output files from PGAP aren't really good for visualization.
To get the nucleotide sequenes convert the CDS coordinates to GTF then extract these coordinates from reference with one of the many options
bedtools getfasta
orsamtools faidx
etc.Does your EMBL file contain the genomic sequence as well, or do you have those in a separate file?
seqretsplit will not help you, that's to split a (multi) file in single files. What you want to do is to extract the cds sequences described in the embl file.
ah, and depending on what you want to visualise, EMBL is a fine format (you can eg. easily load in a genome browser tool to visualise the gene annotation) ;-) That's btw one way (slow and cumbersome though, but if you only need to process a few?) to convert it to fasta format. Load the embl file and the sequence in a genomebrowser tool (eg. igv, genomeview, ... ) and then export the CDS in fasta format
Looks like these are reference genomes from human microbiome project? Why not get the fasta format sequences from NCBI entries. e.g. ABVP00000000.1 Proteus penneri ATCC 35198