Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Configurable preprocessing for queries #3193

Closed
lonvia opened this issue Sep 7, 2023 · 0 comments · Fixed by #3610
Closed

Configurable preprocessing for queries #3193

lonvia opened this issue Sep 7, 2023 · 0 comments · Fixed by #3610

Comments

@lonvia
Copy link
Member

lonvia commented Sep 7, 2023

There have been a few cases now, where it could be interesting to add some additional processing to an incomming query before it is sent to the tokenizer. It would allow to add custom filters for nonsense queries, do some experiments with NLP pre-processing and it would be needed for the splitting of Japanese queries as proposed in #3158.

This should work in a very similar way to the sanitizers used during import, i.e. the ICU tokenizer allows to specify a list of modules with preprocessing functions that are run in sequence over the incomming query.

Configuration

The yaml for the configuration should look about the same as the sanitizer with the step key naming the module to use and any further keys setting the configuration.

Example:

query-preprocessing:
  - step: clean-by-pattern
    pattern: \d+\.\d+\.\d+.\d+
  - step: normalize
  - step: split-key-japanese-phrases

This would execute three preprocessing modules: clean_by_pattern, normalize and split_key_japanese_phrases, normalize would be the step that runs the normalization rules over the query. This is currently hard-coded in the ICU tokenizer. However, conceptually, it is a simple preprocessing step, too. So we might as well make it explicit. It also means that the user has the choice if they want to run the preprocessing on the original input or on the normalized code. This might for example be relevant for Japanese key splitting already: normalization includes rules to change from simplified to traditional Chinese characters. This looses valuable information because simplified Chinese characters are a clear sign that the input is not Japanese.

Preprocessing modules

The preprocessing modules should go into nominatim/tokenizer/query_preprocessing. Most of this should work exactly like the sanitizer, see base.py.

Each module needs to export a create function, that creates the preprocessor:

def create(config: QueryConfig) -> Callable[[QueryInfo], None]: pass

QueryConfig can be an alias to dict for the moment. We might want to add additional convenience functions as in SanitizerConfig later.

QueryInfo should have as the only field: a List[Phrase]. This should be mutable by the preprocessor function. The indirection via a QueryInfo class allows us to later add more functionality to the preprocessing without breaking existing code.

Loading the preprocessors

It is important, that the preprocessor chain is loaded only once and then cached. The setup function is the right place to do that. self.conn.get_cached_value makes sure that a setup function like _make_transliterator is only executed once. The equivalent code for setting up the Sanitizer chain is at https://github.com/osm-search/Nominatim/blob/master/nominatim/tokenizer/place_sanitizer.py#L28

The tricky part is getting the information from the yaml configuration. This needs access to the Configuration object, which is not available here. We should add this as a property to the SearchConnection class. It can be easily added from self.config when it is created here. Once this is done, something along the lines of self.conn.config.load_sub_configuration('icu_tokenizer.yaml', config='TOKENIZER_CONFIG')['query-preprocessing'] should do the trick.

Using the preprocessors

This is mostly done in PR 3158 already. The only difference would be that the list of functions is not hardcoded anymore and that the phrases are mutated inside a QueryInfo object instead of returning the mutated phrase from the function.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant