Hi,
I am somewhat new to using Biopython and have the following problem: I have a plasmid
with annotations (features), that I open with SeqIO.parse
. From this plasmid, I want to extract a certain region, together with the respective features. Hence I create a new SeqFeature
with the FeatureLocation
that I want to cut out of the plasmid. With this created feature
, I do
cutout = feature.extract(plasmid)
However, the function extract
only extracts features that are within the given range. It does not partially copy features and crops them. Is there a way, to do so? Or would I need to manually write a function to do so? Furthermore, it is unable to copy features with compound locations, for example features that wrap around the end and origin of the sequence.
Here is an example script
seq = "TA" + "A".join(["C"*9] * 10) + "AG"
test_plasmid = SeqRecord(
seq,
features=[
SeqFeature(FeatureLocation(0, len(seq)), type="source"),
SeqFeature(FeatureLocation(5, 15), type="CDS"),
SeqFeature(FeatureLocation(92, 99), type="CDS"),
SeqFeature(FeatureLocation(99, len(seq))+FeatureLocation(0, 5), type="CDS"), # compound feature wrapping around origin
SeqFeature(FeatureLocation(4, 96), type="misc_feature"),
SeqFeature(FeatureLocation(20, 50), type="gene"),
],
annotations={
"molecule_type": "ds-DNA",
"topology": "circular",
},
id="test",
name="test",
description="test",
)
feature = SeqFeature(FeatureLocation(90, len(seq))+FeatureLocation(0, 13))
cutout = feature.extract(test_plasmid)
print(test_plasmid .features)
print(cutout.features)
While the test_plasmid
has six features, only one (the third one) gets copied. I understand that cropping features is not done. But why is feature four not copied? It also lies within the range, although it is a compound location.
I am happy about any help!!!
Best,
Niklas
Am I right in thinking you're just trying to get features within a given range? Biopython supports this natively - see here for instance: Slicing Genbank File by 'gene_id' range . Pretty sure you're using SeqFeatures and
.extract
wrong above.Compound features will complicate things however, and I think you may have to write logic specifically to handle these. I can't say I've worked with these closely though so I would advise looking at the documentation.
You might also find this discussion about circular sequences and compound features helpful: https://github.com/biopython/biopython/issues/2158
This approach might be more useful: Extract genbank features inside a given range
Unless you are adamant about only isolating features with a specific range, I think the way to go might be to check if any part of any of the features you care about falls inside the range, and then pull out the feature in full. Its coords might terminate outside your specified initial range but maybe that's OK, I don't know.
You will also need to be mindful of the type of feature (e.g a
CDS
or asource
etc), IIRC, not all of them are extracted as features in the sense most would use it. Someone who has tested recently can correct me, but I think, for instance thesource
record is ignored unless you specifically try to retrieve it.Thank you for your quick reply.
I was using
.extract
, as I want to crop wrapped around the origin, which the native[start, end]
notation does not support. But the native slicing also does not support copying features that start or stop outside of the start..end range.However, I will look into the other posts also.