How To Extract Core Genome From Whole Genome Sequence Data?
2
0
Entering edit mode
11.0 years ago
HG ★ 1.2k

Hi everyone, I have 50 E.coli whole genome sequence data. I did denovo assembly with spades. As a result i have 50 contig file and 50 scaffold file. I want to extract core genome and accessory genome in separately from all 50 genome. Can any one suggest me how i can do proceed to next step?

• 6.2k views
ADD COMMENT
1
Entering edit mode

is the "core genome" term means common genome? I would annotate genomes using the reference one, and then make a table of genes and their presence in each of 50 and their similarity to the reference - this table will help to extract any subset configured by minimal support and score thresholds

ADD REPLY
1
Entering edit mode

Yes core genome means common region among all species. I am thinking like this way after assembly i can map all the contig with contiguator to a reference genome and i can extract map contig and unmap contig. Next step if i run a Cd hit with all those contig i can get the common contig among all 50 genome. Whats your view regarding my opinion ??

ADD REPLY
0
Entering edit mode

sounds good to me. what can possibly happen - due to the contig level of granularity - is that you will end up with no room to change parameters except using the CD-HIT similarity threshold.

ADD REPLY
0
Entering edit mode

Yes i appreciate your opinion. If i go after annotation if will also be fine but it will take little bit more time doing in RAST or GenDB. I will try both way parallel. For Cd hit what will be ideal cut off could you please give me any idea??

ADD REPLY
0
Entering edit mode

I don't know, this is the part of the problem I was mentioning, while running cd-hit you may face the the situation when some contigs are not OK to be clustered together and some you'd rather keep together, but cd-hit will set them apart, because of lengths etc. but you may not face that - it depends on the assemblies, also it is my opinion, I might be wrong.

ADD REPLY
0
Entering edit mode
11.0 years ago
5heikki 11k

Another option would be all vs all blast, some filtering, and then mclblastline..

ADD COMMENT
0
Entering edit mode
11.0 years ago

Yet another option: Pass the assemblies into Cortex, and dump unitigs (=supernodes in Cortex jargon), and then pass them back into Cortex, using Cortex's pan_genome_matrix option - it will give you a big matrix showing you which unitigs are in which samples. Then you can make your own choices about what percentage of samples a contig needs to be in, to be considered "core". 90%? 95%? 100% etc

Roughly speaking, the command lines are

run_calls.pl --fastaq_index INDEX_SPECIFYING_SAMPLE_ID_AND_ASSEMBLY --kmer_size 21 --mem_height 21 --mem_width 100 --do_union no --auto_clean no --outdir DIR

This will make Cortex graph files of all the assemblies

Then, dump unitigs

ls DIR/binaries/unclean/31/*.ctx > list_of_binaries ls list_of_binaries > pool

cortex_var_31_c1 --kmer_size 21 --mem_height 21 --mem_width 100 --colour_list pool --output_supernodes unitigs.txt

This dumps unitigs as fasta file called unitigs.txt

And finally, dump the matrix which has first column =contig-id, second column = % of 21mers in contig in sample1, next column= % of 21mers in contig in sample 2.. etc, and rows are contigs.

cd DIR/binaries/unclean/31 for f in ls *.ctx; do echo pwd/$f > $f.filelist; done; cd ../../../.. ls DIR/binaries/unclean/31/*filelist > colourlist_of_samples

cortex_var_31_c100 --kmer_size 31 --mem_height 21 --mem_width 100 --colour_list colourist_of_samples --pan_genome_matrix unitigs.txt --max_read_len <max contig="" length="" in="" unitigs.txt="">

ADD COMMENT
0
Entering edit mode

There's a lot more detail on what all those options mean in the Cortex manual: http://cortexassembler.sourceforge.net/cortex_var_user_manual.pdf

ADD REPLY
0
Entering edit mode

Thank you for your suggestion. I will have a look and let you know update.

ADD REPLY

Login before adding your answer.

Traffic: 2063 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6