Question

How To Extract Core Genome From Whole Genome Sequence Data?

0

Entering edit mode

10.9 years ago

HG ★ 1.2k

Hi everyone, I have 50 E.coli whole genome sequence data. I did denovo assembly with spades. As a result i have 50 contig file and 50 scaffold file. I want to extract core genome and accessory genome in separately from all 50 genome. Can any one suggest me how i can do proceed to next step?

• 6.2k views

ADD COMMENT • link updated 8.3 years ago by Biostar 20 • written 10.9 years ago by HG ★ 1.2k

1

Entering edit mode

is the "core genome" term means common genome? I would annotate genomes using the reference one, and then make a table of genes and their presence in each of 50 and their similarity to the reference - this table will help to extract any subset configured by minimal support and score thresholds

ADD REPLY • link 10.9 years ago by Pavel Senin ★ 1.9k

1

Entering edit mode

Yes core genome means common region among all species. I am thinking like this way after assembly i can map all the contig with contiguator to a reference genome and i can extract map contig and unmap contig. Next step if i run a Cd hit with all those contig i can get the common contig among all 50 genome. Whats your view regarding my opinion ??

ADD REPLY • link 10.9 years ago by HG ★ 1.2k

0

Entering edit mode

sounds good to me. what can possibly happen - due to the contig level of granularity - is that you will end up with no room to change parameters except using the CD-HIT similarity threshold.

ADD REPLY • link 10.9 years ago by Pavel Senin ★ 1.9k

0

Entering edit mode

Yes i appreciate your opinion. If i go after annotation if will also be fine but it will take little bit more time doing in RAST or GenDB. I will try both way parallel. For Cd hit what will be ideal cut off could you please give me any idea??

ADD REPLY • link 10.9 years ago by HG ★ 1.2k

0

Entering edit mode

I don't know, this is the part of the problem I was mentioning, while running cd-hit you may face the the situation when some contigs are not OK to be clustered together and some you'd rather keep together, but cd-hit will set them apart, because of lengths etc. but you may not face that - it depends on the assemblies, also it is my opinion, I might be wrong.

ADD REPLY • link 10.8 years ago by Pavel Senin ★ 1.9k

score 0 · Answer 1 · 2014-01-16

0

Entering edit mode

10.9 years ago

5heikki 11k

Another option would be all vs all blast, some filtering, and then mclblastline..

ADD COMMENT • link 10.9 years ago by 5heikki 11k

score 0 · Answer 2 · 2014-01-16

Yet another option: Pass the assemblies into Cortex, and dump unitigs (=supernodes in Cortex jargon), and then pass them back into Cortex, using Cortex's pan_genome_matrix option - it will give you a big matrix showing you which unitigs are in which samples. Then you can make your own choices about what percentage of samples a contig needs to be in, to be considered "core". 90%? 95%? 100% etc

Roughly speaking, the command lines are

run_calls.pl --fastaq_index INDEX_SPECIFYING_SAMPLE_ID_AND_ASSEMBLY --kmer_size 21 --mem_height 21 --mem_width 100 --do_union no --auto_clean no --outdir DIR

This will make Cortex graph files of all the assemblies

Then, dump unitigs

ls DIR/binaries/unclean/31/*.ctx > list_of_binaries ls list_of_binaries > pool

cortex_var_31_c1 --kmer_size 21 --mem_height 21 --mem_width 100 --colour_list pool --output_supernodes unitigs.txt

This dumps unitigs as fasta file called unitigs.txt

And finally, dump the matrix which has first column =contig-id, second column = % of 21mers in contig in sample1, next column= % of 21mers in contig in sample 2.. etc, and rows are contigs.

cd DIR/binaries/unclean/31 for f in ls *.ctx; do echo pwd/$f > $f.filelist; done; cd ../../../.. ls DIR/binaries/unclean/31/*filelist > colourlist_of_samples

cortex_var_31_c100 --kmer_size 31 --mem_height 21 --mem_width 100 --colour_list colourist_of_samples --pan_genome_matrix unitigs.txt --max_read_len <max contig="" length="" in="" unitigs.txt="">