Better merging of 'staggered overlapping entities #35

RichJackson · 2024-06-10T12:18:01Z

Original comments from @EFord36

For a string:

Acute lymphoblastic leukemia, Acute myeloid leukemia, Chronic myelogenous leukemia, Chronic lymphocytic leukemia

if we have synonyms for all the actual items in the list, but also:

lymphoblastic leukemia, Acute
myeloid leukemia, Chronic
myelogenous leukemia, Chronic

The MergeOverlappingEntsStep currently says 'all of these are overlapping, I'm only picking one' and then picks the longest (everything else being equal).

Instead, we would do something cleverer like instead of sorting by a single longest entity and taking the top entity, we would sort by the set of non-overlapping sets of entities that covers the widest span of text within the total overlapping span.

My 'cleverer' suggestion above is a version of Weighted Single-Interval Scheduling Maximisation, which is easy to implement with O(n^^2) complexity naively, wikipedia has pseudocode for an O(n) implementation (though not keeping track of the intervals needed): https://en.wikipedia.org/wiki/Interval_scheduling#Weighted

One concern with this approach is coming up with the weights - we risk introducing a bunch of 'magic numbers' of how we weight entities with mappings, vs the length of the span, vs. ent_class_preferred_order vs. the final entity class name choice.

Probably for doing this work, we might want to capture a test set of 'interesting merge cases' - which we could probably capture by running kazu over the full text articles set I used for performance benchmarking, and then add an if clause within MergeOverlappingEnts Step that writes out cases where it has to make a decision, and the decision isn't just a simple single set of mutually overlapping entities.

One difficulty here as well is how we handle non-contiguous entities - we can probably keep doing them as we current do for now, but it isn't especially principled.

Another case where this causes a bug:

'headache dizziness' in the current meddra model pack

‘Headache’ and ‘dizziness’ get matches from explosion that are exact match
Transformers produces a ‘headache dizziness’ match that doesn’t get mapped, but is not filtered out by the cleanup step because the meddra model pack only filters out unmapped entities if they come from the explosion step
The merge step does group_entities_by_location, and puts all three entities in the same group, to then only choose one ‘preferred entity’ from. Headache Dizziness is dispreferred as it doesn’t have a mapping, but then ‘dizziness’ is chosen over ‘headache’ , and ‘headache’ is dropped (even though the two don’t overlap)

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Better merging of 'staggered overlapping entities #35

Better merging of 'staggered overlapping entities #35

RichJackson commented Jun 10, 2024

Better merging of 'staggered overlapping entities #35

Better merging of 'staggered overlapping entities #35

Comments

RichJackson commented Jun 10, 2024