Orchard Search module on Azure

Topics: Customizing Orchard, Installing Orchard
Apr 18, 2011 at 1:10 PM

Hi!

Does anyone know how to make the Orchard Search module work on Azure? It seems to be storing the indexes on AppData, which will obviously fail on Azure.

Any help appreciated.

Thanks!

Coordinator
Apr 18, 2011 at 7:24 PM

It actually works in Azure because each instance has it's own indexing background thread, and has its own index. Though, this is extensible and one implementation could be to provide a cached Blob based lucene Folder implementation. Microsoft Research did it a few years ago, and you can still find some articles about that on the Net.

Dec 2, 2011 at 2:52 PM

Hello:

So it sounds like Orchard is using Lucene to create a local index per each Web Instance?  Just wanted to clarify that.  So, if I spin up 2 instances (which I am about to launch), Orchard attaches a writable Azure drive (from the local web instance file system) and creates the full index - per Instance?

Yeah, some form of single implementation needs to happen.

Lucene has the limitation that only 1 Indexer Engine can have write-access to an index at a time.  Meaning, a pool of Lucene Indexers will not scale in Azure. 

Azure has the limitation that the CloudDrive can only be mounted as Writable by 1 Web/Worker Role instance at a time, and all others can only have Read access.

This is how I have set Lucene up in Azure in the past, but as you can see it doesn't scale well.  I ended up having to split the indexes up into multiple smaller indexes to keep the performance up.

I also setup Solr in Azure in a test bed.  Though, Solr is only Tomcat/Apache, and Solr still has the same limitation of Lucene that only 1 Web/Worker role instance can have open write access to the underlying Lucene index.

I may take a stab at this, to see if there has been any updates on the Solr/Lucene front in the past 2 years (since I last used them).

Coordinator
Dec 2, 2011 at 6:40 PM

"Lucene has the limitation that only 1 Indexer Engine can have write-access to an index at a time.  Meaning, a pool of Lucene Indexers will not scale in Azure."

It's ok, each instance has only one writer per index, one index per tenant.

"Azure has the limitation that the CloudDrive can only be mounted as Writable by 1 Web/Worker Role instance at a time, and all others can only have Read access."

We are not using CloudDrive, but the local file system, and if it gets recycled or a new instance is spin up, then the index is recovered by the background task on this instance.

Dec 3, 2011 at 2:14 AM
sebastienros wrote:

"Azure has the limitation that the CloudDrive can only be mounted as Writable by 1 Web/Worker Role instance at a time, and all others can only have Read access."

We are not using CloudDrive, but the local file system, and if it gets recycled or a new instance is spin up, then the index is recovered by the background task on this instance.


So each webrole instance has it's open copy of Lucene, that is what you are saying right?

Just need you to confirm this.  That would seem to be that the indexes can easily get out of sync, if I have, say 5 web instances running talking to a single SQL server - in Azure.

Coordinator
Dec 3, 2011 at 5:27 AM

They are out of sync, and not. When you create some content, the index is created in a background task, so technically even on the instance you created it it's out of sync. And this time is as long as it is for other nodes. In average 30 sec.

Dec 4, 2011 at 6:12 AM
sebastienros wrote:

They are out of sync, and not. When you create some content, the index is created in a background task, so technically even on the instance you created it it's out of sync. And this time is as long as it is for other nodes. In average 30 sec.


Perhaps you can point me to the implementation for Azure for this for what class is handling this.

I understand that a "background task" (IScheduleTask I'm assuming) updates the local-to-that-instance index of Lucene.

But, how are the other webrole instances notified that a piece of data has been updated, and that they're "background task" local to that webrole isntance needs to reindex that one item.

That's what I see missing here.  I don't see how an item can be marked dirty in the DB for all webrole isntances to pickup and reindex.  Maybe if there was some timestamp on a single entity that indicates it has been modified.  But, to query for all pieces of content, and compare each of the timestamps to what Lucene has seems quite excessive.  Though, I don't see anything like that going on. 

So, how are the other nodes / webrole instances notified a piece of content has been updated.

Again, feel free to point me to some code.  I just don't see it.

Coordinator
Dec 4, 2011 at 6:41 AM

There is a table to keep track of any content which needs to be indexed. And each instance keeps tracks of the latest it indexed. Look at the Orchard.Indexing module, it's the same on premise and for azure.