Is there a way to segregate Plasmid contigs from Iontorrent WGS data?
1
0
Entering edit mode
5.8 years ago
Optimist ▴ 190

I have a fastQ and fasta file of a klebsiella pneumoniae genome and would like to separate Plasmid contigs from WGS fasta file with 221 contigs.

Separate plasmid contigs would help me characterize plasmids and resistant genes it possess using a circular representation and further enable me carry out downstream specialized analysis.

Can someone suggest me a tool or pipeline in this regard?

Thanks & Regards

Optimist

Assembly Plasmid WGS Ion torrent • 1.3k views
ADD COMMENT
2
Entering edit mode
5.8 years ago
5heikki 11k

There might be some plasmid specific binning program. Even metagenome binning programs may be able to achieve what you want, e.g. MaxBin. However, since complete reference genomes exist for Klebsiella pneumoniae, perhaps the easiest way is to download such genome, e.g. this one. It has 7 chromosomes:

>NC_016845.1 Klebsiella pneumoniae subsp. pneumoniae HS11286 chromosome, complete genome
>NC_016838.1 Klebsiella pneumoniae subsp. pneumoniae HS11286 plasmid pKPHS1, complete sequence
>NC_016846.1 Klebsiella pneumoniae subsp. pneumoniae HS11286 plasmid pKPHS2, complete sequence
>NC_016839.1 Klebsiella pneumoniae subsp. pneumoniae HS11286 plasmid pKPHS3, complete sequence
>NC_016840.1 Klebsiella pneumoniae subsp. pneumoniae HS11286 plasmid pKPHS4, complete sequence
>NC_016847.1 Klebsiella pneumoniae subsp. pneumoniae HS11286 plasmid pKPHS5, complete sequence
>NC_016841.1 Klebsiella pneumoniae subsp. pneumoniae HS11286 plasmid pKPHS6, complete sequence

Extract the chromosome sequence from the fasta file into another fasta file. Then blast your contigs against the new fasta file. All the contigs that produce long alignments will clearly represent non-plasmid DNA.

ADD COMMENT
0
Entering edit mode

Thank you for your answer.

I need to segregate Chromosomal DNA contigs & Plasmid DNA contigs separately from a fasta file exactly like that of example ref file HS11286 you have shown.

Can you please elaborate further on the solution you've given

I have downloaded Klebsiella pneumoniae reference genome file from NCBI.

Thanks & regards

ADD REPLY
1
Entering edit mode

Well, you don't really even need to extract any sequences from the file I linked. Just blast your contigs against it:

blastn -query yourContigs.fa -subject theRefFile.fa -outfmt 6 > blastResult.txt

As the output file will not be that big, you can even open it in excel (tabs separate fields). Studying just the first four columns ought to take you far (query, subject, percent identity and alignment length). However, if you want to be efficient and have a bash shell at hand, this outputs only the best hits for each of your contigs:

export LANG=C LC_ALL=C; sort -t $'\t' -k1,1 -k12,12gr -k11,11g blastResult.txt | sort -t $'\t' -uk1,1 > blastResultBestHits

Then to show only the contigs where the best hit was against the reference genome chromosome:

awk 'BEGIN{FS="\t"}{if($2~/NC_016845/){print $1}}' blastResultBestHits

And contigs where the best hit was against something other than the reference genome chromosome:

awk 'BEGIN{FS="\t"}{if($2!~/NC_016845/){print $1}}' blastResultBestHits

Search "how to extract sequences from fasta based on header" to find out how to then extract whatever from your assembly fasta file..

ADD REPLY

Login before adding your answer.

Traffic: 1286 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6