I have a fastQ and fasta file of a klebsiella pneumoniae genome and would like to separate Plasmid contigs from WGS fasta file with 221 contigs.
Separate plasmid contigs would help me characterize plasmids and resistant genes it possess using a circular representation and further enable me carry out downstream specialized analysis.
Can someone suggest me a tool or pipeline in this regard?
There might be some plasmid specific binning program. Even metagenome binning programs may be able to achieve what you want, e.g. MaxBin. However, since complete reference genomes exist for Klebsiella pneumoniae, perhaps the easiest way is to download such genome, e.g. this one. It has 7 chromosomes:
Extract the chromosome sequence from the fasta file into another fasta file. Then blast your contigs against the new fasta file. All the contigs that produce long alignments will clearly represent non-plasmid DNA.
I need to segregate Chromosomal DNA contigs & Plasmid DNA contigs separately from a fasta file exactly like that of example ref file HS11286 you have shown.
Can you please elaborate further on the solution you've given
I have downloaded Klebsiella pneumoniae reference genome file from NCBI.
As the output file will not be that big, you can even open it in excel (tabs separate fields). Studying just the first four columns ought to take you far (query, subject, percent identity and alignment length). However, if you want to be efficient and have a bash shell at hand, this outputs only the best hits for each of your contigs:
Thank you for your answer.
I need to segregate Chromosomal DNA contigs & Plasmid DNA contigs separately from a fasta file exactly like that of example ref file HS11286 you have shown.
Can you please elaborate further on the solution you've given
I have downloaded Klebsiella pneumoniae reference genome file from NCBI.
Thanks & regards
Well, you don't really even need to extract any sequences from the file I linked. Just blast your contigs against it:
As the output file will not be that big, you can even open it in excel (tabs separate fields). Studying just the first four columns ought to take you far (query, subject, percent identity and alignment length). However, if you want to be efficient and have a bash shell at hand, this outputs only the best hits for each of your contigs:
Then to show only the contigs where the best hit was against the reference genome chromosome:
And contigs where the best hit was against something other than the reference genome chromosome:
Search "how to extract sequences from fasta based on header" to find out how to then extract whatever from your assembly fasta file..