Pass-By-Reference Problem When Using Websphere Application Server

This has been kicking around, nearly finished, for months. It's not going to get any better, or shorter, so it's long past time I put it out there.

It’s also just long; and technical. So feel free to ignore. I won’t be offended.

I rarely write about programming or other technical issues here, but I probably should do so more often. Certainly in a case like this.

I often think about the many, many problems that I’ve had help with from strangers on the internet; people who have taken the time to write blog posts, answers to questions on forums, or technology tutorials. My job would barely be possible at times without the web. Of course, we didn’t have it back when I started in 1987; but we didn’t do such complex things, with so many different languages and technologies.

Anyway, all these kind strangers have helped me, and I rarely find myself in a position to give anything back to the community. So since I recently hit a problem that no-one else seems to have had, it’s really my duty to describe it, and my solution, in the hope that it might be of use to someone down the line.

If you’re looking for the solution to the problem with pass-by reference on WAS, and don’t want to read the story of how I got there, you can jump straight to The New Bug

Background

We develop our main app using a fairly standard n-tier architecture using JEE: web front end using JSPs and Struts; EJBs; a multiplicity of database platforms accessed using Spring. All fairly standard stuff, whose purpose is to move financial messages around.

A lot of this was originally developed when I wasn’t around (I was seconded to another department) and by contractors and others who are no longer with us. So I take no responsibility for the stupidities that exist in codebase. Or rather, I accept no blame. I do, in fact, have responsibility for it; for keeping it going and developing it onwards now.¹

One of the bad choices that was made by the original developers of this version of the product, was that they should cache the results of database queries. The users can define various criteria by which they want to select a set of messages to view; those get translated into SQL, which our Java code executes using JDBC. Again, all standard stuff. JDBC was designed for exactly that kind of thing. Databases exist solely to do that kind of thing.

So the wise and sensible developers decided that performance would be a problem if a query returned many rows from the database. They decided that transferring the rows to the browser and allowing the user to scroll through them would be impossible. So they designed a caching mechanism.

Thing is, JDBC has that kind of caching built right in. And furthermore, they (our developers) included a limit: a maximum number of rows to return, which could be set to 50, 100, 500, or 1000. Pretty reasonable, since any query that returned over 200 or so rows is likely to be less than useful, anyway.

But they still built that caching mechanism.

The Mechanism

That’s all right, though, I hear you say. Cache the rows server-side in memory, return a subset to the user as they page through them. It sounds fine.

True enough. Except they didn’t cache them in memory. Oh no. That would have been too sensible. And might have caused performance problems (I’m sure they thought, if they even considered they matter). No, they cached them elsewhere. Where? In the database.

Yes, they introduced another table; a shadow table; an almost-identical duplicate of the Messages table, called MessageQueryResults. Executing a query then consisted of selecting the required rows and writing them into this table, keyed by the HTTP session ID; and then re-querying this results table to get a page worth of results.

So, to recap, then: to improve performance (without first determining that there was actually a problem), they replaced a simple database read with a read, a set of writes, and another set of reads.

That was bound to perform better, right?

The Failure

It wasn’t performance that brought this flimsy edifice crashing down, though.² No, it actually ran quite successfully for several years. Three things brought about its end: Microsoft, multiplicity, and me.³

Microsoft’s part was through their database platform, SQL Server. Between one version and another they changed something about their storage mechanism, so that you could no longer rely on rows on a table being in the sequence in which they were written to the table. The thing is, you’re not supposed to be able to rely on that, according to DB theory. That was another flaw in the “design” above; it relied on the shadow table’s rows being returned in the same sequence they were written in. On Oracle and DB2 that worked; and it did on SQL Server too, until (if memory serves) the 2005 version. This meant that clients on that platform who had large queries couldn’t rely on them being displayed in the right order.

Oh dear.

Hacks were applied to sort this out. Pun intended: sorting is pretty much what they did. Not a fix, but a workaround at best.

The multiplicity part was that the same mechanism was used to query another table; and then a third. And there was a fourth on the horizon. Each new table meant a new shadow table which had to be maintained in parallel – and whose creation and upgrade scripts had to be maintained across three database platforms. A maintenance nightmare.

Then there was me. I had known about the problem for some time, of course – I had done an estimate for fixing it – but there was never time to fix it. It was a big task, quite intrusive, and showing no easily-provable customer benefit. Yes, I know ease of maintenance, by making life easier for developers, is an implicit customer benefit; but try selling that to management, when there are customers crying out for new features.

But in the project that was to introduce the fourth table (or seventh and eighth, you might say) I was in a position to say, “we fix this first, or it all goes to hell”.

The user story was written. I got my estimate out of hibernation (and increased it, of course). And then I did the fix. It was a great joy. That in-memory caching mechanism I mentioned above? I did that. If I’d been designing it from scratch I would almost certainly have relied on JDBC’s internal caching, at least until it proved problematic. But under the circumstances, when the code relied on there being a cache, it was going to be much less disruptive to retain one. I just replaced the stupid one with a more sensible one.

Inevitably, though, I introduced a new bug.

The New Bug

This is where I stop telling a story and start explaining the problem and solution.

Introducing the new message caching mechanism, which replaced the MessageQueryResults table, inadvertently caused a problem when we set a WAS server to pass-by-reference mode.

This mode is recommended when the different tiers of the application (web, EJB) are running in the same JVM. This is normally the case in our test environments, and frequently the case in client systems. Enabling this mode removes the need for objects to be copied as they are passed through the tiers, and can improve performance dramatically in such environments.

What Went Wrong

Changing the caching mechanism caused no problem as long as pass-by-reference was off. As soon as it was turned on, we noticed that taking certain actions, such as deleting a message, failed.

The failure was at a point in the code where the a value such as an amount was being retrieved from the Map that formed the new cache. The failure was that the retrieved Object was being cast to a Number, but what was in the Map entry was actually a String.

This Map comes, by a fairly complex set of steps, from the new cache, and before that from the database itself, of course.

Now, since the Amount column on the DB is numeric, and the Map in question is originally populated via Spring from the DB, obviously the value was a numeric one originally. This suggested that the value must have been changed, and that gave us the first clue to tracking down the cause of the problem, and coming up with a solution.

The Cause

It seemed likely – and running debug, it was shown to be so – that the numeric value that was retrieved from the database and stored in the Map was being replaced by the edited value which is built for displaying. In other words, the object now contained a String holding digits, a decimal point, and probably commas.

Why it Changed When We Switched on Pass-By-Reference

When the data was being passed by value, a new Map, complete with its contents, was being passed from the EJB layer to the webapp. The webapp then updated values in that Map, editing them for display purposes. But it was only changing its own copy; it had no effect on the version stored back in the EJB layer. So when the same Map was retrieved again, so that the action could be performed, a new copy was received by the webapp. No problem.

But when pass-by-reference is on, no copy is made. The webapp receives a reference to the actual Map that is stored in the cache back in the EJB layer. So when it updates an entry in that Map, it updates the very object that is stored in the cache (note that the put method of the Map interface will update the stored value if it receives a key that it already holds).

And then when the Map and its entry are retrieved again for the action to be performed, it is the updated (and now wrong) version that is retrieved.

Why it Changed When We Changed the Caching Mechanism

And yet both possible passing settings were available before we changed the caching mechanism. Why did we not get this problem when using pass-by-reference with the old caching mechanism?

The answer to that is that the old mechanism cached the query results in the database itself, in the MessageQueryResults table. Each time a set of results was requested by the webapp, the EJB layer went back to this temporary table and populated the Map that it returned to the webapp. So the amount value would always have been set up freshly from the numeric Amount column, which ensured that it was an object of type Number.

The Fix

I tried making copies of the Maps and Lists used, at various points in the process, including using ImmutableLists and ImmutableMaps from Google’s Guava library, in an attempt to prevent the value object of interest from being updated. However, it wasn’t possible to make them immutable deeply enough (and would probably have caused other problems if it had been). That was largely because the principal Map is created and populated by Spring, so we don’t have much control over it.

One solution – and probably the proper one – would have been to copy the entries from the Map at the point they are read and processed in the webapp. This would have meant that the edited, String, version of amount would be a different, new object, and would not have been updated in the Map that came from the cache.

However, the vast complexity of the class where this would have had to happen made this seem like a very difficult and dangerous approach, especially at this late point in the project.

An alternative solution was suggested by one of my colleagues. It was to accept the fact that the amount value might be a String containing a numeric value with commas and decimal point, and to parse the numeric value out of it.

This allowed us to cater for both numeric and string values, and it worked with either form of passing semantics. But it felt like a hack, and I was sure it would come back to bite us.

Fixing the Fix, a Little Later

It did. The trigger this time was paging through the list of results; when you returned to a page you had already seen, you ended up with an object of the wrong kind coming out of the Map. If memory serves it was a String where it should have been a Date.

It was clearly another result of the data being edited for display and updated in-place in the Map. There are too many possible places in in the relevant method to rely on finding them all, so I returned to the “probably the proper” solution mentioned above. I changed the relevant method such that it now returns a copy of the List containing the required subset of the query results. This is less straightforward than might be hoped, because copying a List, including by the clone method of the implementing class, for example, tends to do a “shallow copy”, which means that you get a new List instance, but containing references to the same objects.

I wrote a method called copyList, which iterates over a List and makes “deep” copies of a few expected types of object. We may have to extend this method to handle other types, but I don’t expect that at the moment.

Also Worth Noting

There is a warning about this on IBM’s Best Practice: Using pass by reference for EJBs and servlets if in same JVM page, but it’s one of those typical contrived kind of examples that probably wouldn’t really alert you to the possibility of something like my experience.

Notes

To set the pass-by-reference mode on or off, take the following steps in the WAS administrative console (this is WAS 6, it’s probably different at other releases).

Go to Servers -> Application servers -> <server -name>

Expand Container Services; click on ORB Service; check/uncheck “Pass by reference”

Not just me, I should note. ↩︎
Or more folding down, slowly, over years. ↩︎
I love a bit of alliteration, don’t you? ↩︎