Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better merging of 'staggered overlapping entities #35

Open
RichJackson opened this issue Jun 10, 2024 · 0 comments
Open

Better merging of 'staggered overlapping entities #35

RichJackson opened this issue Jun 10, 2024 · 0 comments

Comments

@RichJackson
Copy link
Collaborator

Original comments from @EFord36

For a string:

Acute lymphoblastic leukemia, Acute myeloid leukemia, Chronic myelogenous leukemia, Chronic lymphocytic leukemia

if we have synonyms for all the actual items in the list, but also:

lymphoblastic leukemia, Acute
myeloid leukemia, Chronic
myelogenous leukemia, Chronic

The MergeOverlappingEntsStep currently says 'all of these are overlapping, I'm only picking one' and then picks the longest (everything else being equal).

Instead, we would do something cleverer like instead of sorting by a single longest entity and taking the top entity, we would sort by the set of non-overlapping sets of entities that covers the widest span of text within the total overlapping span.

My 'cleverer' suggestion above is a version of Weighted Single-Interval Scheduling Maximisation, which is easy to implement with O(n^^2) complexity naively, wikipedia has pseudocode for an O(n) implementation (though not keeping track of the intervals needed): https://en.wikipedia.org/wiki/Interval_scheduling#Weighted

One concern with this approach is coming up with the weights - we risk introducing a bunch of 'magic numbers' of how we weight entities with mappings, vs the length of the span, vs. ent_class_preferred_order vs. the final entity class name choice.

Probably for doing this work, we might want to capture a test set of 'interesting merge cases' - which we could probably capture by running kazu over the full text articles set I used for performance benchmarking, and then add an if clause within MergeOverlappingEnts Step that writes out cases where it has to make a decision, and the decision isn't just a simple single set of mutually overlapping entities.

One difficulty here as well is how we handle non-contiguous entities - we can probably keep doing them as we current do for now, but it isn't especially principled.

Another case where this causes a bug:

'headache dizziness' in the current meddra model pack

  • ‘Headache’ and ‘dizziness’ get matches from explosion that are exact match
  • Transformers produces a ‘headache dizziness’ match that doesn’t get mapped, but is not filtered out by the cleanup step because the meddra model pack only filters out unmapped entities if they come from the explosion step
  • The merge step does group_entities_by_location, and puts all three entities in the same group, to then only choose one ‘preferred entity’ from. Headache Dizziness is dispreferred as it doesn’t have a mapping, but then ‘dizziness’ is chosen over ‘headache’ , and ‘headache’ is dropped (even though the two don’t overlap)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant