Yet another option: Pass the assemblies into Cortex, and dump unitigs (=supernodes in Cortex jargon), and then pass them back into Cortex, using Cortex's pan_genome_matrix option - it will give you a big matrix showing you which unitigs are in which samples. Then you can make your own choices about what percentage of samples a contig needs to be in, to be considered "core". 90%? 95%? 100%
etc
Roughly speaking, the command lines are
run_calls.pl --fastaq_index INDEX_SPECIFYING_SAMPLE_ID_AND_ASSEMBLY --kmer_size 21 --mem_height 21 --mem_width 100 --do_union no --auto_clean no --outdir DIR
This will make Cortex graph files of all the assemblies
Then, dump unitigs
ls DIR/binaries/unclean/31/*.ctx > list_of_binaries
ls list_of_binaries > pool
cortex_var_31_c1 --kmer_size 21 --mem_height 21 --mem_width 100 --colour_list pool --output_supernodes unitigs.txt
This dumps unitigs as fasta file called unitigs.txt
And finally, dump the matrix which has first column =contig-id, second column = % of 21mers in contig in sample1, next column= % of 21mers in contig in sample 2.. etc,
and rows are contigs.
cd DIR/binaries/unclean/31
for f in ls *.ctx
; do echo pwd
/$f > $f.filelist; done;
cd ../../../..
ls DIR/binaries/unclean/31/*filelist > colourlist_of_samples
cortex_var_31_c100 --kmer_size 31 --mem_height 21 --mem_width 100 --colour_list colourist_of_samples --pan_genome_matrix unitigs.txt --max_read_len <max contig="" length="" in="" unitigs.txt="">
is the "core genome" term means common genome? I would annotate genomes using the reference one, and then make a table of genes and their presence in each of 50 and their similarity to the reference - this table will help to extract any subset configured by minimal support and score thresholds
Yes core genome means common region among all species. I am thinking like this way after assembly i can map all the contig with contiguator to a reference genome and i can extract map contig and unmap contig. Next step if i run a Cd hit with all those contig i can get the common contig among all 50 genome. Whats your view regarding my opinion ??
sounds good to me. what can possibly happen - due to the contig level of granularity - is that you will end up with no room to change parameters except using the CD-HIT similarity threshold.
Yes i appreciate your opinion. If i go after annotation if will also be fine but it will take little bit more time doing in RAST or GenDB. I will try both way parallel. For Cd hit what will be ideal cut off could you please give me any idea??
I don't know, this is the part of the problem I was mentioning, while running cd-hit you may face the the situation when some contigs are not OK to be clustered together and some you'd rather keep together, but cd-hit will set them apart, because of lengths etc. but you may not face that - it depends on the assemblies, also it is my opinion, I might be wrong.