Loading content items takes too long

Topics: General
Dec 7, 2013 at 7:41 PM
Edited Dec 7, 2013 at 7:47 PM
Hello,

I have to load/update periodically information from an external system.

I created a ScheduledTask, the idea is to run 1 time per day, read information from a set of products (approximately 2000) and then create/update ContentItems.

What I have is something like:
var products = _externalSource.GetProducts();

var cont = 0;
foreach(var prod in products)
{
    //check if exists
    var p = _contentManager.Query<ProductPart, ProductRecord>(VersionOptions.Latest)
                        .Where(p => p.Ref == prod.Ref)
                        .FirstOrDefault();
    if (p == null)
    {
        //create if no exists
        ContentItem item = _contentManager.New("Product");
        _contentManager.Create(item);
        p = item.As<ProductPart>(); 
    }

    //update
    p.Ref=prod.Ref;
    p.Price=prod.Price;
    //more properties ...

    //publish
    _contentManager.Publish(p.ContentItem);


    if (++cont%100 == 0) 
    {
        //commit and begin new transaction
        _transactionManager.RequireNew();
    }
}
Everything works fine, what happens is that it takes too much, for example the task takes approximately about 25 minutes for 2000 products.

Why does it take so long? What am I doing wrong? Is there a better way to do this kind of thing?

Thanks in advance.
A greeting.
Developer
Dec 7, 2013 at 9:34 PM
Although it seems 25 minutes is a bit too much for 2000 items with this implementation there are multiple important points of possible improvements:
  • The whole thing would be much better in batches (of e.g. 100). Although it's not obvious background tasks are not designed as "workers" that can be run indefinitely.
  • You're opening a new transaction for every single update, what is a huge overhead. I don't think this is justified (or rather: it's only justified if you deal with 2000 items at once, but this is a bad idea itself as I mentioned), if you do the work in smaller batches you also don't have to worry about transactions either.
  • The first query is issued for every item separately, resulting in 2000 selects to begin with. You could re-write this to select all the items at once, then e.g. put it into a dictionary for retrieval in the loop.
  • In the sample you're only updating ProductPart, which already has it's corresponding record loaded because of the query, so no +n queries here. If you are also updating other parts however you should also Join() those in or include them in .WithQueryHints().
  • I hope Ref is indexed. If it can contain only unique values and ProductRecord is a ContentPartRecord, not a ContentPartVersionRecord then it can be unique too.
Dec 8, 2013 at 5:27 PM
Edited Dec 8, 2013 at 6:07 PM
Thanks for your answer Piedone.

I don't understand when you say "You're opening a new transaction for every single update," in the above example I'm creating a new transaction every 100 iterations (or at least I think), is this correct?

I followed your advice and I have gotten a better performance, but still not enough. What I have right now is something like:
const int kMaxCommit = 100;
var externalProducts = _externalSource.GetProducts();

var localProducts = _contentManager.Query<ProductPart, ProductRecord>(VersionOptions.Latest)
                            .List()
                            .ToDictionary<string, ProductPart>(p => p.Ref);

var count = 0;
foreach(var prod in externalProducts)
{
    ProductPart p = null;
    
    //check if exists
    if (!localProducts.TryGetValue(prod.Ref, out p))
    {
        //create if no exists
        ContentItem item = _contentManager.New("Product");
        _contentManager.Create(item);
        
        p = item.As<ProductPart>(); 
    }

    bool hasChanged = false;
    
    //update (if necessary)
    if (p.Price != prod.price)
    {
        p.Price = prod.Price;
        hasChanged = true;
    }
    
    if (p.Description != prod.Description)
    {
        p.Description = prod.Description;
        hasChanged = true;
    }
        
    //... update more properties ...

    if (hasChanged)
    {
        //publish
        _contentManager.Publish(p.ContentItem);
    }

    if (++count % kMaxCommit == 0) 
    {
        //commit and begin new transaction
        _transactionManager.RequireNew();
    }
}
The execution time of this code is variable: from a few seconds (in the best case: all products already exist and none has changes) to 12 minutes in the worst case scenario (the first charge), you can see the performance has improved but still is too slow.

I guess the bottleneck is in:
ContentItem item = _contentManager.New("Product");
_contentManager.Create(item);
How can I minimize database accesses? Is it possible to create multiple content items at once?

Thank you.
A greeting.
Developer
Dec 8, 2013 at 6:53 PM
Yeah you're right about the transaction, I forgot about the if clause.

Now your code does 2001 DB queries, what is not a small number but 12 minutes seems like too long for this. Have you tried some profiling to check where the bottleneck is?

But most possibly it's indeed the Create() call. I don't think there is anything to optimize regarding this purely through Orchard's API but I guess there is a way in NHibernate (e.g. to prevent the flushing of changes - Orchard relies on auto-flush - and rely on flushing them explicitly). So I'd advise you to check out whether there is anything to do to optimize this directly through NHibernate (using a custom, stateless session would be one option; but I'm not sure how you could get it together with ContentManager and co).

Out of curiosity, what happens if your change the transaction IsolationLevel to ReadUncommited in SessionLocator?

BTW if your content item is not draftable you don't have to call Publish(). Even if it were draftable Publish() would be only needed if you'd fetch draft versions with VersionOptions.DraftRequired.
Dec 9, 2013 at 8:41 AM
Analyzing in more detail the execution I saw that at the beginning of the process the performance was not bad, the first items were processed well, but slowly, with each iteration performance began to fall.

By chance I happened to try to put a "_contentManager.Clear ();" and everything changed completely. Now the process takes just a few seconds even in the worst case scenario!.

This is de code:
const int kMaxCommit = 100;
var externalProducts = _externalSource.GetProducts();

var localProducts = _contentManager.Query<ProductPart, ProductRecord>(VersionOptions.Latest)
                            .List()
                            .ToDictionary<string, ProductPart>(p => p.Ref);

var count = 0;
foreach(var prod in externalProducts)
{
    ProductPart p = null;
    
    //check if exists
    if (!localProducts.TryGetValue(prod.Ref, out p))
    {
        //create if no exists
        ContentItem item = _contentManager.New("Product");
        _contentManager.Create(item);
        
        p = item.As<ProductPart>(); 
    }

    bool hasChanged = false;
    
    //update (if necessary)
    if (p.Price != prod.price)
    {
        p.Price = prod.Price;
        hasChanged = true;
    }
    
    if (p.Description != prod.Description)
    {
        p.Description = prod.Description;
        hasChanged = true;
    }
        
    //... update more properties ...

    if (hasChanged)
    {
        //publish
        _contentManager.Publish(p.ContentItem);
    }

    if (++count % kMaxCommit == 0) 
    {
        //commit and begin new transaction
        _transactionManager.RequireNew();
        
        //clear content manager
        _contentManager.Clear();
    }
}
Thanks for your help.
A greeting.
Developer
Dec 9, 2013 at 11:23 AM
Nice finding! I have no idea why this helps: it's basically an NHibernate session.Clear(). Maybe NHibernate itself begins to choke on the inserts or something.
Coordinator
Dec 9, 2013 at 2:53 PM
This is the same code we have in the Import feature. Typically NHibernate kkeps track of all the loaded object, and this takes a lot of memory if you process lots of items, and will slow down the process as every time a content item is loaded it will have to first look for it in this collection. When we import unique items then it makes sense to clear the session. Also I can assume that running all this loads and writes in a single transaction might slow down the database too as it has to isolate them.

Things to analyze to understand the impact of these factors:
  • test with and without a new Transaction
  • test with and without clearing the session
  • test with and without updating to ReadUncommitted
Might you try these scenarios I am sure we could guess a best practice from them.
Dec 11, 2013 at 7:04 PM
Edited Dec 11, 2013 at 7:06 PM
Hi, apologies for the delay, I've done some tests with various combinations, here you have the results:

All tests were done for the worst case scenario (first charge)

Test 1, with:
_transactionManager.RequireNew();
_contentManager.Clear();
It takes 169 seconds.


Test 2, with:
//_transactionManager.RequireNew();
_contentManager.Clear();
It takes 147 seconds (good time, but logically, the site is inaccessible during that time)

Test 3, with:
_transactionManager.RequireNew();
//_contentManager.Clear();
It takes 700 seconds.


Test 4, with:
_transactionManager.RequireNew(IsolationLevel.ReadCommitted);
//_contentManager.Clear();
It takes 688 seconds.

Test 5, with:
_transactionManager.RequireNew(IsolationLevel.ReadCommitted);
_contentManager.Clear();
It takes 165 seconds.

Regards.
Developer
Dec 19, 2013 at 5:38 PM
Test 1 and 5 is in effect the same, as RequireNew() by default requires a new transaction with IsolationLevel.ReadCommitted.
Mar 16, 2014 at 3:33 PM
Thanks, SegmentationFault, for all the tests. I just ran into the same problem when importing 17,000 contentitems.

Did anyone test what batch size is best? Is 100 a good number?
Coordinator
Mar 17, 2014 at 7:37 PM
I am using batches of 50. A website with around 35K content items is imported in 30 minutes on my machine. It's long but works fine and that is fine with me (user speaking here)
Mar 18, 2014 at 9:49 PM
I realize I am late to the conversation. I had high hopes of using native Import/Export but that has been bogging down. I had a small module to read the xml and push it through native Import/Export process but the solution is taking about one minute for every 50 content items (nothing even close to what is being realized here).

So I am wondering whether I need to abandon my hopes on the Import/Export module and create custom solutions every time I have a client who needs a lot of data moved. Possibly I am missing something and there is a better way to batch the Import/Export data. Right now my memory just continues to grow so the worker process drains the memory and needs terminated.

Love some tips, thanks in advance.
Coordinator
Mar 19, 2014 at 12:28 AM
Edited Mar 19, 2014 at 12:28 AM
@jao28, can you confirm you have applied a BatchSize="100" property on your <Data /> step ?

Something like:
<Data BatchSize="50">
    <Blog Id="/alias=" Status="Published">
...
Mar 19, 2014 at 3:58 AM
@sebastienros I have utilized BatchSize of "50" and it does not seem to have made any impact (other than I think if I run it longer now the memory won't build up). It is taking anywhere from 1 to 3 minutes to run a batch of 50 items. I am wondering if it is because of the number of identities I am running. Here is a single import XML markup. The Provider Id is a random Identifier so it won't ever match up with the existing content:
<Provider Id="/Identifier=6b607ec7593d46df91391487134b9776" Status="Published">
  <InputField.MedicalSchool Value="" />
  <EnumerationField.Type Value="Person" />
  <NumericField.NationalProviderIdentifier Value="111111111" />
  <TaxonomyField.Region Terms="/alias=region\/bucks-county,/alias=region\/bucks-county\/langhorne" />
  <TaxonomyField.Service Terms="/alias=service\/physician-assistants-advanced-practice-nursing-providers,/alias=service\/physician-assistants-advanced-practice-nursing-providers\/nurse-practitioner,/alias=service\/physician-assistants-advanced-practice-nursing-providers\/nurse-practitioner\/family" />
  <TitlePart Title="Some Physicians Name" />
  <AutoroutePart Alias="" UseCustomPattern="false" />
  <CommonPart Owner="/User.UserName=manager" CreatedUtc="2014-03-17T15:48:02Z" PublishedUtc="2014-03-17T15:48:02Z" ModifiedUtc="2014-03-17T15:48:02Z" Container="" />
  <BodyPart Text="" />
  <MetaPart Keywords="Keywords here" Description="Description here." />
  <CustomCss />
  <ContactPart FirstName="First" MiddleName="Middle" LastName="Last" Suffix="" Title="CRNP" Gender="Female" />
  <AddressPart Address1="Address 1" Address2="Address 2" City="Langhorne" County="Bucks County" State="PA" Country="US" PostalCode="19047" GeocodeUpdateNeeded="False" Latitude="40.17555" Longitude="-74.9226" GeocodeProvider="Mapquest" GeocodeQualityCode="A5XBX" GeocodeStaticMapPath="http://open.mapquestapi.com/staticmap/v4/getmap?key=Fmjtd|luubn9u2l9,72=o5-907w00&amp;type=map&amp;size=225,160&amp;pois=purple-1,40.174555,-74.92266,0,0|&amp;center=40.1745538,-74.9226665&amp;zoom=12&amp;rand=933425963" />
</Provider>
As you can see, the content type has a number of parts with a lot of these parts using Identities for lookup. This means the import process needs to look up all of these identities (typically aliases). Could that alone be enough to drag the whole process down or might there be something else I should be looking for?

Finally, I have always run the process synchronously (one XML file with 50 content items at a time) but I am going to try and change my script to run asynchronously. Perhaps by pushing the XML to the site faster and keeping the queue loaded up the overall times will be minimized.

I gladly welcome feedback.
Mar 19, 2014 at 4:55 PM
Edited Mar 19, 2014 at 5:05 PM
I have been analyzing the actual import process to see where the slowest links are and this is my observation (time estimates are rough, from observation rather than 100% precise start / stop of time in code):
  • By using a batch of 100 I am able to import the 100 in 5 minutes (which is one content item as displayed above imported every 3 seconds)
  • About 0.5 seconds are used on DefaultContentManager --> Import --> line var item = importContentSession.Get(identity, VersionOptions.Latest, XmlConvert.DecodeName(element.Name.LocalName));
    • This appears to be because of the time finding the identity (if it exists and creating a new item)
  • After this, it appears that the majority of the time is spent in Publish(item) (as in around 2 seconds)
    • This seems to have the majority tied up in handler.Published events
    • Of the time spent in the Published events, about 1.5 seconds seems to be spent in the TermPartsHandler --> RecalculateCount method
    • Unfortunately, this makes good sense as I am taxonomy heavy in my content item so every additional item I add will cause it to need to count more items
    • In addition, most of my taxonomy fields have multiple terms selected so it needs to perform the count 4 or 5 times plus working up the hierarchy when it is part of a contained element (sometimes multiple levels)
I am not sure if there is a great way to improve on the above process. Even if I didn't use native Import/Export it appears I would have the same challenges with the handlers (would save a little time on getting the identity as I could pass in the actual id / taxonomy term ids but those don't seem to be that significant of a slowdown. If I override the RecalculateCount part of the handler I would then need to figure out how to recalculate the count on all the terms at the end of the import.

As always, thoughts and comments welcome.
Mar 19, 2014 at 5:21 PM
UPDATE: because I never like throwing out too rough of numbers I went ahead and disabled the RecalculateCount method on the Taxonomy terms and here were the results:
  • By using a batch of 100 I am able to import the 100 in 2 minutes (which is one content item as displayed above imported every 1.2 seconds)
So that is a dramatic improvement (for times when you are importing taxonomy intensive content items). This also seems to be consistent with @SegmentationFault findings above of 100 items

For my particular import it will take me down from 5 days to 2 days but then I will need to devise a RecalculateCountOnAll method. That is an inconvenience to keeping the process simple.

@sebastienros this is still not close to 35,000 content items imported in 30 minutes as you mentioned (that would be 1 content item every 0.05 seconds). Are those the actual numbers you were seeing? I realize I have the debugger attached so things might speed up some but don't see how I can ever get to that speed.
Coordinator
Mar 19, 2014 at 10:17 PM
I confirm, actual numbers. In my current project I need to import 1000 tenants with between hundreds to thousands content items.

BTW, I have improved the BlogPost count and also Comments Count in 1.x. I assume the same refactoring can be applied on Terms count which would improve your overall import perf. The idea is that instead of counting in the driver, it adds a message to a list, which is processed at the end of the unit of work, so that it actually counts it only once. Can you see to apply the same logic ? Should be mostly copy/pasting.

Use the release mode, this should improve significantly, and maybe add some db indexes on the columns that are used for identity resolution.
Jun 10, 2014 at 2:13 AM
I'd be interested to know if you have tried a large import with the Workflow module disabled?

I found that the workflow content activities where causing me performance issues when creating large numbers of content items in a single request.

I implemented a change to the WorkflowManager class which improved performance. Discussion here: https://orchard.codeplex.com/discussions/547870
Jun 10, 2014 at 3:33 AM
Wow, Sebastien, I missed your last comments (so did not yet try the refactoring that was done on Comments Count but I am going to try and get some time on it).

@mjy78, no, I never did try a large import with the Workflow module disabled I checked through your thread and it seems interesting. I will keep my eyes open when I do this again. It did not come to my attention when I did profiling but that doesn't mean it too wasn't a culprit. Thanks for mentioning it!