Search Indexing Performance

Topics: Customizing Orchard
Jan 13, 2015 at 2:51 AM
I've got some performance-related questions around search indexing. Perhaps I can start by explaining my use-case:

I'd like to index our Orchard user data to improve search performance. We've added the Contrib.Profile module to our site, which allows additional fields to be attached to the User content type. For example, we've attached FirstName, LastName, and Phone fields. These aren't easily searchable from the ContentManager (outside of using an elaborate HQL query). So, creating a searchable index seemed like the quicker/simpler solution.

Configuring an index around this data is easy enough. Upon further inspection, though it appears that the Orchard Indexing Background Task runs continuously; the task is triggered once a minute by default. I'm concerned about the performance implications of rebuilding the index so frequently. Our site currently has on the order of tens of thousands of users. I understand that the Indexing Background task runs on a separate thread, but should I be concerned about it hammering our web server?

Also, if performance is an issue, is there a way to search these Fields using an HQL query?
Jan 14, 2015 at 2:57 AM
Indeed, every minute the background task is executed
UpdateIndex() on each index (IndexingBackgroundTask.cs)
=> UpdateIndexBatch() (UpdateIndexScheduler.cs)
=> BatchIndex() (IndexingTaskExecutor.cs)
After you have first created and built an index, each time you publish / unpublish an item, Orchard add a task in the "Orchard_Indexing_IndexingTaskRecord" table. In this table you can see a relation to the content item, an Action (0 = update, 1 = remove from the index), and the Id of the task

But, Orchard stores and reuses some data via the "YourIndex.settings.xml" file, especially the "LastIndexedId" value. So, on each update, Orchard can query only those tasks that are not yet executed (id > LastIndexedId). So, the background task will only check in the database that there is nothing to do, this until you publish / unpublish some items. Then, the background task will update only these items

This is done in the BatchIndex() method. In the part of code executed if indexing mode = "IndexingMode.Update", there is this line
.Table.Where(x => x.Id > indexSettings.LastIndexedId)
If there are many items, some resources are freed (a new transaction is requested) every 50 items (ContentItemsPerLoop). You can see the use of this batch size with, for example, this line
.Take(ContentItemsPerLoop) // = 50 items
So, the index isn't rebuilt every minute. It is only updated every minute and only if needed. The only concern is if you already have a lot of items before you use your index, or after an import of many items, or if you rebuild your index from the dashboard... The first index update / rebuild can take a long time, but this happens only once. I don't know how many time it can spend, I have never indexed tens of thousands content items