Unable to persist any changes

Topics: Administration
May 27, 2013 at 11:59 PM
Edited May 28, 2013 at 12:06 AM
Hi,

We have a production site that has stopped persisting changes. The problem started with the creation of a layer. The admin page reported 2 notifications in succession - "layer created" and "layer not found: XX". The net result was the layer was never created (or perhaps created then the transaction rolled back?). Then we noticed that content edits were no longer being accepted. The authors would edit the pages and publish... only to find the new content was not saved. No errors are presented to the authors.

Logs are too obscure to really determine where the problem lies (obscure for someone not an orchard developer). I can find the localization callout for "layer not found" but the preceding information isn't clear.

What should I examine? It seems like there is a bad reference that is basically forcing all subsequent updates to choke.

This is 1.6.1 running on Azure with sqlce.

Any advice would be appreciated in solving this problem. We've been blocked for a couple of days and simply no closer to finding an answer.

Thanks,
-Maurice
Coordinator
May 28, 2013 at 1:50 AM
The logs may not help you, but they would really help us assess the situation.
May 28, 2013 at 7:26 AM
Thanks for the reply. Is there a particular setting (aside from logging priority = All) or starting point that would help? For example, log from cold start, log from warm starting point outlining failed layer creation only?

Also, what is the best way to get the logs to you? Should I just post a link to a skydrive folder or is there another format we can use? (newbie questions, I know - first time working with an issue via codeplex).
Coordinator
May 28, 2013 at 7:58 AM
Just one stack trace for a request that should have persisted something should be enough. Just paste it here.
May 29, 2013 at 10:03 PM

The log for the failure condition was over 51k lines. I’ve uploaded it to a skydrive folder rather than swamp this tread.

The log represents all activity captured during a failed attempt to create layer “test0529-1”. The system churned for about 2 minutes, then returned to the widget page. No errors were presented although the log does indicate layer 26 was not found (line 29341) and UI did not include the new layer as a option in the dropdown.

Let me know if anything else is needed,

-Maurice

Coordinator
May 31, 2013 at 7:05 AM
Looks like your database is corrupted. You might want to go into the tables and manually delete all records related to that non-existing layer.
May 31, 2013 at 8:21 AM
Edited May 31, 2013 at 8:30 AM
That was my first inclination. I’ve previously examined Orchard_Widgets_LayerPartRecord, _ContentItem, and _ContentItemVersion and verified there are no rows referencing the missing layer id. However, if there are other tables where the layer is referenced, I’m simply unaware (although I have tried to find such references). Can you provide some guidance on which other tables I need to examine? Or provide the sql query to locate the rows? Also, what could have caused this condition? This seems like a pretty serious issue that should be tracked down so that it can’t occur again.
May 31, 2013 at 5:26 PM
I'm starting to believe this is an application timing issue, not a data issue.

We took a backup of the "damaged" db. Then we ran a few tests against 2 machines - the original Azure shared instance and a beefy reserved instance. Right out of the gate, the beefy instance experienced no issues creating new layers. Given the near immediate successful response, this was a clear indication the db was in a good state.

The tests against the shared instance are producing random results. Out of seven tests (all resetting the db before a run), 5 tests failed to create layers and 2 tests were successful. It seems that the slower system characteristics have a drastic affect on how the application responds.

If the database starting point is the same in all instances and the database is corrupted, the machine characteristics should not influence the final result - layer creation should fail consistently. However, we can randomly create new layers, which indicates the database is in a good state and we're left at examining how the application reacts to the queries it generates. Right?

In looking at the logs, I can see the call where the layer is created and where the layer is subsequently not found. However there are no errors that indicate an update failed or a rollback had been started. It's as though the layer is created and somewhere between the creation time and first time it's called up, it just disappears.

I'm more than willing to further examine the sql database, but I need help (as noted in my previous reply - I'm learning the schema as fast as possible). If there is an application level timing issue (as I believe there is), what can of data can I collect to help isolate where the app is failing to lock/wait for the right state?
Coordinator
May 31, 2013 at 7:45 PM
So you're using SQL CE on Azure. How many instances do you have running?
May 31, 2013 at 7:51 PM

A single instance for all tests.

Coordinator
May 31, 2013 at 8:40 PM
So your azure web site is configured to use just the smallest size, with one instance? At this point, if you have repro steps, I would file a bug.
Coordinator
May 31, 2013 at 8:41 PM
Upgrading to SQL Server may also help, maybe.
Jun 1, 2013 at 8:27 AM
Edited Jun 1, 2013 at 8:28 AM
In development, when the log are too verbose, I obtain this kind of latency so I change ALL by Warn.

If it's an error with the SQL you should obtain some informations by add these line.
 <logger name="NHibernate.SQL">
    <priority value="WARN" />
  </logger>-->
http://weblogs.asp.net/bleroy/archive/2013/04/03/logging-sql-queries-in-orchard.aspx
Jun 2, 2013 at 8:55 PM
Bernard: the repro steps are pretty basic. the problem is that I don't have a way to repro a slow machine which in turns impacts the application. the company i'm trying to help wants a small site for nearly 2 dozen pages and for all intents in purpose they are happy with the Shared instance. they wanted to step up from generic asp.net pages over to a nicer system to facilitate content edits. we realize that Azure's performance is highly erratic - and that's clearly a symptom of a being on a shared service. Unfortunately, Orchard's ability to work reliably is heavily tied to the performance. Some days, the service moves along smoothly and Orchard has no problems doing what it needs to do; however, if resources are tight, persistence is an issue. It's almost like Orchard can't get what it needs quick enough and instead of waiting for the task to complete, it just moves along until it errors downstream rather than blocking/waiting then failing.

Alexandre: I posted the full log with verbose logging so that someone could hopefully dig through the file and pick out why Orchard loses content. I am fully aware that verbose logging on a limited system will only further exacerbate problems. However, even with verbose logging turned on, the logs don't show indicate why the error condition appears. I have gone back and selectively turned off a lot of the extraneous loggers in hopes of recording the right data during the tests. I have looked through all the SQL entries and nothing is jumping out as an error condition.


This is a puzzle and if I had a clear environment repro, I would gladly post it. The user repro is at the top of this discussion. This appears to be an application issue related to low or slow resources. Today, June 2nd, as of 20 minutes ago we visibly noticed the site is slow (Azure shared instance is apparently bogged down) and any change we try to post is just lost. Orchard thinks it's done the right thing but it's very clearly not. If the logs show "Layer not found" and but Orchard does not present that information to the user, that tells me that Orchard is missing something. Again, years of tracing mysterious bugs tells me this is not a SQL issue but rather an app layer issue.

Thanks for all the help... I'll keep digging. For the coders in the house, I recommend examining the code to see how it would react on a system with low resources.
Developer
Jun 2, 2013 at 8:57 PM
Welcome back Bernard.
Coordinator
Jun 3, 2013 at 8:04 AM
I was going to say exactly that. Welcome back, Bernard, we've been missing you...