We often need our text search to be agnostic of accent marks. Accent-insensitive search, also called diacritics-agnostic search, is where search results are the same for queries that may or may not contain Latin characters such as à, è, Ê, ñ, and ç. Diacritics are English letters with an accent to mark a difference in pronunciation. In recent years, words with diacritics have trickled into the mainstream English language, such as café or protégé. Well, touché! OpenSearch has the answer!
OpenSearch is a scalable, flexible, and extensible open-source software suite for your search workload. OpenSearch can be deployed in three different modes: the self-managed open-source OpenSearch, the managed Amazon OpenSearch Service, and Amazon OpenSearch Serverless. All three deployment modes are powered by Apache Lucene, and offer text analytics using the Lucene analyzers.
In this post, we demonstrate how to perform accent-insensitive search using OpenSearch to handle diacritics.
Lucene Analyzers are Java libraries that are used to analyze text while indexing and searching documents. These analyzers consist of tokenizers and filters. The tokenizers split the incoming text into one or more tokens, and the filters are used to transform the tokens by modifying or removing the unnecessary characters.
OpenSearch supports custom analyzers, which enable you to configure different combinations of tokenizers and filters. It can consist of character filters, tokenizers, and token filters. In order to enable our diacritic-insensitive search, we configure custom analyzers that use the ASCII folding token filter.
ASCIIFolding is a method used to covert alphabetic, numeric, and symbolic Unicode characters that aren’t in the first 127 ASCII characters (the Basic Latin Unicode block) into their ASCII equivalents, if one exists. For example, the filter changes “à” to “a”. This allows search engines to return results agnostic of the accent.
In this post, we configure accent-insensitive search using the ASCIIFolding filter supported in OpenSearch Service. We ingest a set of European movie names with diacritics and verify search results with and without the diacritics.
Create an index with a custom analyzer
We first create the index
asciifold_movies with custom analyzer
Ingest sample data
Next, we ingest sample data with Latin characters into the index
Query the index
Now we query the
asciifold_movies index for words with and without Latin characters.
Our first query uses an accented character:
Our second query uses a spelling of the same word without the accent mark:
In the preceding queries, the search terms “fête” and “fete” return the same results:
Similarly, try comparing results for “señora” and “senora” or “sorcière” and “sorciere.” The accent-insensitive results are due to the ASCIIFolding filter used with the custom analyzers.
Enable aggregations for fields with accents
Now that we have enabled accent-insensitive search, let’s look at how we can make aggregations work with accents.
Try the following query on the index:
We get the following response:
Create accent-insensitive aggregations using a normalizer
In the previous example, the aggregation returns two different buckets, one for “Jour de fête” and one for “Jour de fete.” We can enable aggregations to create one bucket for the field, regardless of the diacritics. This is achieved using the normalizer filter.
The normalizer supports a subset of character and token filters. Using just the defaults, the normalizer filter is a simple way to standardize Unicode text in a language-independent way for search, thereby standardizing different forms of the same character in Unicode and allowing diacritic-agnostic aggregations.
Let’s modify the index mapping to include the normalizer. Delete the previous index, then create a new index with the following mapping and ingest the same dataset:
After you ingest the same dataset, try the following query:
We get the following results:
Now we compare the results, and we can see the aggregations with term “Jour de fête” and “Jour de fete” are rolled up into one bucket with doc_count=2.
In this post, we showed how to enable accent-insensitive search and aggregations by designing the index mapping to do ASCII folding for search tokens and normalize the keyword field for aggregations. You can use the OpenSearch query DSL to implement a range of search features, providing a flexible foundation for structured and unstructured search applications. The Open Source OpenSearch community has also extended the product to enable support for natural language processing, machine learning algorithms, custom dictionaries, and a wide variety of other plugins.
About the Author
Aruna Govindaraju is an Amazon OpenSearch Specialist Solutions Architect and has worked with many commercial and open-source search engines. She is passionate about search, relevancy, and user experience. Her expertise with correlating end-user signals with search engine behavior has helped many customers improve their search experience. Her favorite pastime is hiking the New England trails and mountains.