Elasticsearch: a flexible search engine
What is Umbraco Examine?
Examine is an abstraction of Lucene.Net and is used by Umbraco to index and search for Umbraco content. The latest version of Examine uses Lucene.Net 3.0.3, which was released in 2012. The latest version of Lucene.Net is 4.8.
When working with Examine for more demanding websites there are a number of problems. For example, the Lucene implementation on the .NET platform when compared to the original Java version of Lucene is quite outdated, Lucene.net is at 4.8 and the Java version is at 8.2.0.
La The latest version of Lucene.Net has many performance improvements and better multilingual support, in particular for the CJK languages (Chinese, Japanese and Korean), namely morphological analysis rather than standard tokenization.
Examine is a problematic vendor when the website uses a load-balanced high-availability configuration, since Examine is file-based and you can get file locking problems in a load-balanced configuration.
For Umbraco v8 and v7 there is Azure blob storage for Examine, however this is still experimental.
UPDATE: The Examine.AzureDirectory package compatible with v7 was released on February 11, 2020. Anyway, v8 is still an experimental package.
One of the strengths of Examine is that it makes working with indexing/searching with Lucene much easier than working with Lucene in raw form, however this ease comes with a strong bond between Examine and Lucene. So when Lucene is updated, it's not as simple as updating it in Examine, it requires a lot of changes in Examine.
We want the flexibility of Examine but we also want the latest features of the latest version of Lucene.
What is Elasticsearch and why is it used?
Elasticsearch is a high-performance search provider that supports replication and reconstruction without downtime. It's perfect for CJK and has support for many more languages. It was designed as a search for all types of data (text, numerical, geographical, structured and unstructured). Elastic also supports many different types of queries, from Lucene to SQL queries.
In addition, unlike Examine, Elastic provides great development tools, such as Kibana, that allow you to simulate queries, debug them and analyze indexes. Elastic is also designed for high availability, which means load balancing, replication, reindexing without downtime, and more.
The story behind the project
The idea of building an Examine Elastic provider came about when I saw a presentation about searching in Umbraco v7 at the 2018 Polish Festival, where Ismail Mayat presented a POC of an indexer for v7 that used content tracking. After the presentation, I was looking for a better way to do it, without using external processes to index the content.
I found some useful sources. First it was a package called 'Umbraco. Elasticsearch', created by Phil Oyston, and an article called 'Elasticating Examine - an experimental exam provider', written by Tristan Thompson. Neither solution gave me satisfaction, since the first was to create unnecessary logic around Umbraco and reimplement the Examine provider, and the second was only for custom indexes and required some changes to work with Umbraco.
The V7 package
For Umbraco v7, I reviewed all the available Elastic packages and decided not to use them as a base for my project. In my opinion, it wasn't a good approach since I was reimplementing indexing, management and other options that I would only replace in Umbraco instead of still having indexes in two places (Umbraco Examine Files and Elastic Instance).
At the time, the solution proposed by Tristan Thompson was closer to my idea, since it only created a translation layer between Examine and Elastic. I decided to continue my work based on what was already working in that experimental vendor.
One of the first changes I made was to upgrade the Elastic version to 6.5 and start working to allow the indexing of all types of content such as Media, Content and Members. At the time, everything was working and I decided to start replacing the internal index with Elastic. Here I found some small problems:
- Because we were using only published versions of the content, the search wasn't always relevant to actual content that wasn't published.
- The index could not show the state of health in the Umbraco Backoffice.
- All document properties were moved to the properties object and, because of that operation, all properties had to use a prefix in the name of the index fields.
- If an implementation were based entirely on NEST, it would not be compatible with Umbraco Lucene queries.
I started by solving a problem with the implementation of NEST/Lucene queries, where I decided to explain two forms of consultation:
Snippet 1 Search Methods
Why did I abandon that project?
- Umbraco v7 was not designed to support custom providers, and in that version you could only make them work in a hacky way, such as reimplementing search logic in the Backoffice.
- Any update to Umbraco could break package compatibility since it was based on Umbraco's core functionality, so any change would stop the vendor.
- There was no abstraction in Examine that would make it easier to maintain the package with Umbraco v7.
- The implementation of iSearchableTree was not something I would use to replace the Backoffice search providers, since all the search logic has to be reimplemented here, and I don't think this is a good way to do it. I think that iSearchableTree should only be used in cases of customized backoffice search, which is not a replacement for the basic one.
Umbraco V8
After leaving the project I talked to some people about how it would be great to have a better option to do this in Umbraco, and how I was even looking at the Umbraco source code for how to change the parts of the hard code to be more abstract and extensible.
At the time, I was talking to Shannon from the Umbraco headquarters, and she suggested that in a new version of Umbraco they would make changes that would allow me to continue my project. I decided to base this on the Azure Search code, as it was an example of how to use the new abstraction layer in Umbraco.
Where is my configuration file?
Umbraco v8 changed the Examine configuration from configuration files to code.
Instead of using the old-fashioned (which I prefer) form of configuration files, we have a brilliant new programmatic way to change Examine settings.
There were some discussions about the pros and cons of that solution, but at the time I started working on that provider, there was still no documented way to change anything in Examine config. I was following a suggestion from the people at GitHub, but a few times I had to spend time reverse engineering how Umbraco handled Dependency Injection, the composition and configuration of Examine.
How to change indexes across components
Since the new version changed the way change settings were handled in Umbraco, I had to figure out how to handle the switch from Examine to Elastic in the source code. As in the first part, it was required to figure out how to disable the Examine component, which attaches the basic indexes. Since Umbraco was using Umbraco Index, which inherits from LuceneIndex, I had to reimplement Content Populators, which fill the content between all the indexes created based on ElasticSearchIndex.
Snippet 2 Register custom vendor
Stay close to Umbraco and emulate events
Umbraco provides some default events in the indexes and will not be compatible with most of the Umbraco code, I have to emulate the Lucene fields even if I'm not using them at all. Lucene fields are not convenient to work with because, unlike most objects, they don't provide the actual type of data. Since you only have to work with one string, you end up converting all the types you need to chain dynamically when someone gets them from the Document Dictionary:
Snippet 3 Emulating two methods from the field list
As shown in the third example, I'm emulating two methods from the current list of fields in Document from Lucene.Net.
Plans for the future
Since I want to work with Elastic on every possible project, I think that as part of future changes I will focus on offering the best developer experience, and I will try to re-implement everything possible to support NEST's queries instead of Lucene's within my vendors.
I'm also planning to research the Umbraco source code and propose changes that allow developers to use better abstraction in the Umbraco core.
You can find the package file in the Next link.
How can we help you?
If you need more information, do not hesitate to contact us.
Cómo podemos ayudarte
Consulta los servicios con los que te ayudaremos a conseguir tus objetivos digitales.