This has been kicking around, nearly finished, for months. It’s not going to get any better, or shorter, so it’s long past time I put it out there.
It’s also just long; and technical. So feel free to ignore. I won’t be offended.
I rarely write about programming or other technical issues here, but I probably should do so more often. Certainly in a case like this.
I often think about the many, many problems that I’ve had help with from strangers on the internet; people who have taken the time to write blog posts, answers to questions on forums, or technology tutorials. My job would barely be possible at times without the web. Of course, we didn’t have it back when I started in 1987; but we didn’t do such complex things, with so many different languages and technologies.
Anyway, all these kind strangers have helped me, and I rarely find myself in a position to give anything back to the community. So since I recently hit a problem that no-one else seems to have had, it’s really my duty to describe it, and my solution, in the hope that it might be of use to someone down the line.
If you’re looking for the solution to the problem with pass-by reference on WAS, and don’t want to read the story of how I got there, you can jump straight to [The New Bug](#newbug)
We develop our main app using a fairly standard [n-tier](http://en.wikipedia.org/wiki/Multitier_architecture) architecture using [JEE](http://java.sun.com/j2ee/overview.html): web front end using JSPs and Struts; EJBs; a multiplicity of database platforms accessed using Spring. All fairly standard stuff, whose purpose is to move financial messages around.
A lot of this was originally developed when I wasn’t around (I was seconded to another department) and by contractors and others who are no longer with us. So I take no responsibility for the stupidities that exist in codebase. Or rather, I accept no blame. I do, in fact, have responsibility for it; for keeping it going and developing it onwards now.
One of the bad choices that was made by the original developers of this version of the product, was that they should cache the results of database queries. The users can define various criteria by which they want to select a set of messages to view; those get translated into SQL, which our Java code executes using JDBC. Again, all standard stuff. JDBC was designed for exactly that kind of thing. Databases exist solely to do that kind of thing.
So the wise and sensible developers decided that performance would be a problem if a query returned many rows from the database. They decided that transferring the rows to the browser and allowing the user to scroll through them would be impossible. So they designed a caching mechanism.
Thing is, JDBC has that kind of caching built right in. And furthermore, they (our developers) included a limit: a maximum number of rows to return, which could be set to 50, 100, 500, or 1000. Pretty reasonable, since any query that returned over 200 or so rows is likely to be less than useful, anyway.
But they still built that caching mechanism.
## The Mechanism
That’s all right, though, I hear you say. Cache the rows server-side in memory, return a subset to the user as they page through them. It sounds fine.
True enough. Except they didn’t cache them in memory. Oh no. That would have been too sensible. And might have caused performance problems (I’m sure they thought, if they even considered they matter). No, they cached them elsewhere. Where? In the database.
Yes, they introduced another table; a shadow table; an almost-identical duplicate of the `Messages` table, called `MessageQueryResults`. Executing a query then consisted of selecting the required rows and writing them into this table, keyed by the HTTP session ID; and then re-querying this results table to get a page worth of results.
So, to recap, then: to improve performance ([without first determining that there was actually a problem](http://c2.com/cgi/wiki?PrematureOptimization)), they replaced a simple database read with a read, a set of writes, and another set of reads.
That was _bound_ to perform better, right?
## The Failure
It wasn’t performance that brought this flimsy edifice crashing down, though. No, it actually ran quite successfully for several years. Three things brought about its end: Microsoft, multiplicity, and me.
Microsoft’s part was through their database platform, SQL Server. Between one version and another they changed something about their storage mechanism, so that you could no longer rely on rows on a table being in the sequence in which they were written to the table. The thing is, you’re not _supposed_ to be able to rely on that, according to DB theory. That was another flaw in the “design” above; it relied on the shadow table’s rows being returned in the same sequence they were written in. On Oracle and DB2 that worked; and it did on SQL Server too, until (if memory serves) the 2005 version. This meant that clients on that platform who had large queries couldn’t rely on them being displayed in the right order.
Hacks were applied to sort this out. Pun intended: sorting is pretty much what they did. Not a fix, but a workaround at best.
The multiplicity part was that the same mechanism was used to query another table; and then a third. And there was a fourth on the horizon. Each new table meant a new shadow table which had to be maintained in parallel — and whose creation and upgrade scripts had to be maintained across three database platforms. A maintenance nightmare.
Then there was me. I had known about the problem for some time, of course — I had done an estimate for fixing it — but there was never time to fix it. It was a big task, quite intrusive, and showing no easily-provable customer benefit. Yes, I know ease of maintenance, by making life easier for developers, is an implicit customer benefit; but try selling that to management, when there are customers crying out for new features.
But in the project that was to introduce the fourth table (or seventh and eighth, you might say) I was in a position to say, “we fix this first, or it all goes to hell”.
The user story was written. I got my estimate out of hibernation (and increased it, of course). And then I did the fix. It was a great joy. That in-memory caching mechanism I mentioned above? I did that. If I’d been designing it from scratch I would almost certainly have relied on JDBC’s internal caching, at least until it proved problematic. But under the circumstances, when the code relied on there being a cache, it was going to be much less disruptive to retain one. I just replaced the stupid one with a more sensible one.
Inevitably, though, I introduced a new bug.
The New Bug
This is where I stop telling a story and start explaining the problem and solution.
Introducing the new message caching mechanism, which replaced the `MessageQueryResults` table, inadvertently caused a problem when we set a WAS server to pass-by-reference mode.
This mode is recommended when the different tiers of the application (web, EJB) are running in the same JVM. This is normally the case in our test environments, and frequently the case in client systems. Enabling this mode removes the need for objects to be copied as they are passed through the tiers, and can improve performance dramatically in such environments.
## What Went Wrong
Changing the caching mechanism caused no problem as long as pass-by-reference was off. As soon as it was turned on, we noticed that taking certain actions, such as deleting a message, failed.
The failure was at a point in the code where the a value such as an amount was being retrieved from the `Map` that formed the new cache. The failure was that the retrieved `Object` was being cast to a `Number`, but what was in the `Map` entry was actually a `String`.
This `Map` comes, by a fairly complex set of steps, from the new cache, and before that from the database itself, of course.
Now, since the `Amount` column on the DB is numeric, and the `Map` in question is originally populated via Spring from the DB, obviously the value was a numeric one originally. This suggested that the value must have been changed, and that gave us the first clue to tracking down the cause of the problem, and coming up with a solution.
## The Cause
It seemed likely — and running debug, it was shown to be so — that the numeric value that was retrieved from the database and stored in the `Map` was being replaced by the edited value which is built for displaying. In other words, the object now contained a `String` holding digits, a decimal point. and probably commas.
## Why it Changed When We Switched on Pass-By-Reference
When the data was being passed by value, a new `Map`, complete with its contents, was being passed from the EJB layer to the webapp. The webapp then updated values in that `Map`, editing them for display purposes. But it was only changing its own copy; it had no effect on the version stored back in the EJB layer. So when the same `Map` was retrieved again, so that the action could be performed, a new copy was received by the webapp. No problem.
But when pass-by-reference is on, no copy is made. The webapp receives a reference to the actual `Map` that is stored in the cache back in the EJB layer. So when it updates an entry in that `Map`, it updates the very object that is stored in the cache (note that the `put` method of the `Map` interface will update the stored value if it receives a key that it already holds).
And then when the `Map` and its entry are retrieved again for the action to be performed, it is the updated (and now wrong) version that is retrieved.
## Why it Changed When We Changed the Caching Mechanism
And yet both possible passing settings were available before we changed the caching mechanism. Why did we not get this problem when using pass-by-reference with the old caching mechanism?
The answer to that is that the old mechanism cached the query results in the database itself, in the `MessageQueryResults` table. Each time a set of results was requested by the webapp, the EJB layer went back to this temporary table and populated the `Map` that it returned to the webapp. So the amount value would always have been set up freshly from the numeric `Amount` column, which ensured that it was an object of type `Number`.
## The Fix
I tried making copies of the `Map`s and `List`s used, at various points in the process, including using `ImmutableList`s and `ImmutableMap`s from [Google’s Guava library](http://code.google.com/p/guava-libraries/), in an attempt to prevent the value object of interest from being updated. However, it wasn’t possible to make them immutable deeply enough (and would probably have caused other problems if it had been). That was largely because the principal `Map` is created and populated by Spring, so we don’t have much control over it.
One solution — and probably the proper one — would have been to copy the entries from the `Map` at the point they are read and processed in the webapp. This would have meant that the edited, `String`, version of amount would be a different, new object, and would not have been updated in the `Map` that came from the cache.
However, the vast complexity of the class where this would have had to happen made this seem like a very difficult and dangerous approach, especially at this late point in the project.
An alternative solution was suggested by one of my colleagues. It was to accept the fact that the amount value might be a `String` containing a numeric value with commas and decimal point, and to parse the numeric value out of it.
This allowed us to cater for both numeric and string values, and it worked with either form of passing semantics. But it felt like a hack, and I was sure it would come back to bite us.
## Fixing the Fix, a Little Later
It did. The trigger this time was paging through the list of results; when you returned to a page you had already seen, you ended up with an object of the wrong kind coming out of the `Map`. If memory serves it was a `String` where it should have been a `Date`.
It was clearly another result of the data being edited for display and updated in-place in the `Map`. There are too many possible places in in the relevant method to rely on finding them all, so I returned to the “probably the proper” solution mentioned above. I changed the relevant method such that it now returns a copy of the `List` containing the required subset of the query results. This is less straightforward than might be hoped, because copying a `List`, including by the `clone` method of the implementing class, for example, tends to do a “shallow copy”, which means that you get a new `List` instance, but containing references to the same objects.
I wrote a method called `copyList`, which iterates over a `List` and makes “deep” copies of a few expected types of object. We may have to extend this method to handle other types, but I don’t expect that at the moment.
## Also Worth Noting
There is a warning about this on IBM’s [Best Practice: Using pass by reference for EJBs and servlets if in same JVM](http://www.ibm.com/developerworks/websphere/library/bestpractices/ejbs_and_servlets_using_pass_by_reference.html) page, but it’s one of those typical contrived kind of examples that probably wouldn’t really alert you to the possibility of something like my experience.
To set the pass-by-reference mode on or off, take the following steps in the WAS administrative console (this is WAS 6, it’s probably different at other releases).
Go to `Servers -> Application servers -> `
Expand `Container Services`; click on `ORB Service`; check/uncheck `”Pass by reference”`