When Prepared Statements Go Bad
August 28th, 2009 by Brett HendersonWhen writing Java applications that interact with a relational database, it is typical to use the JDBC PreparedStatement class to execute queries and updates. There are several key reasons for using this class instead of the simpler Statement class:
- Simplified use of variables. The use of bind variables avoids having to escape the variables being added to the query. This avoids dealing with formatting variables date formats, escaping string terminators, and eliminates SQL injection attacks.
- Query Plan Re-Use. The removal of dynamic variables from the query string allows the database to cache the resultant query plan. The database engine will see two queries using the same query string and can re-use the previously calculated query plan.
- Re-Use of Prepared Statements. A prepared statement can be re-used for many operations against the database. This may be used multiple times within a single transaction, or re-used across many transactions using client-side Prepared Statement caching.
There are exceptions to every rule however, and there are cases where Prepared Statements do not provide the optimal solution. This article discusses a recent scenario in which using Prepared Statement functionality had a serious detrimental impact on performance, and discusses the options available for resolving the issue.
Problem
The Open Street Map project is an Internet community that aims to become the “Wikipedia of Maps”. The project is building a vector data set containing features such as roads, geographical boundaries, coastlines, buildings, points of interest, and in fact any feature which exists in the physical world. It has many parallels to Google Maps, but the data is freely available in vector form. This geo-spatial data set is a technology enabler allowing all kinds of new and innovative tools and applications to be built.
The central Open Street Map database is approximately 1 terabyte in size, and contains many hundreds of millions of rows. Statistics showing the growth and activity over time can be found at the following URL:
http://wiki.openstreetmap.org/wiki/Stats
A complete snapshot of the dataset without history can be downloaded in compressed XML form and is currently approximately 6.6 gigabytes expanding to approximately 125 gigabytes when uncompressed. This snapshot known as the “planet” file provides the basis for consumers of the data to build their own databases and applications to use the data. Due to the size of the planet file, it is not feasible to download too often and therefore consumers cannot get access to new data in real-time. A new planet file is made available weekly at the following URL:
http://planet.openstreetmap.org
To solve the real-time problem of access to data, a tool called Osmosis is used to extract regular changesets from the database and post them on a public web server for download. These changesets are available in day, hour, and minute intervals. Osmosis is written in Java, and uses JDBC to access the production PostgreSQL relational database. Osmosis queries changes from 8 major tables, and several supporting tables. It extracts these changes based on timestamp. It queries for all data created within a particular time interval, and writes the changes to a new changeset file.
Using these change files, it is possible for consumers to keep their own databases synchronised with the production database. Using the minute changesets, it is possible to keep a replica database within approximately 6 minutes of the primary database. This near real-time feed eliminates the need for most applications to directly access the production database drastically improving performance for applications and allowing the production database to scale to a much larger user base.
Recent enhancements to Osmosis resulted in many of the queries to be re-written. Unfortunately, the performance of these queries was unacceptable. The queries need to execute in such a way that their execution time is proportional to the amount of data being extracted. The new queries were taking several hours to run regardless of the time interval being queried. This is bad when extracting day changesets, and completely unworkable when extracting minute changesets.
Diagnosis
The database contains the following primary entity types:
- Nodes – Represent a single point on a map. Defined in terms of a latitude and longitude (eg. a point of interest).
- Ways – Represent a linear path between several nodes (eg. a road).
- Relations – A logical grouping of several nodes, ways and/or relations.
Each of those tables has a corresponding tag table which allows name/value pairs to be attached to describe the feature. For example a road may have the tags highway=primary and name=<street name> attached to it.
There are several other key tables in the database, the key ones being:
- Changesets – All edits are grouped into changesets. Each node/way/relation is related to a changeset.
- Users – All edits are performed by a registered user. Each changeset is related to a user.
The database contains many other tables, but not important to this discussion. For example, ways have a join table to the node table and relations have a generic join table to nodes, ways and relations.
The problematic query is shown below:
SELECT e.id, e.version, e.timestamp, e.visible, u.data_public,
u.id AS user_id, u.display_name, e.changeset_id, e.latitude, e.longitude
FROM nodes e
INNER JOIN (
SELECT id, version FROM
nodes WHERE timestamp > ? AND timestamp <= ?
) t ON e.id = t.id AND e.version = t.version
INNER JOIN changesets c ON e.changeset_id = c.id
INNER JOIN users u ON c.user_id = u.id
The query could be simplified, but the sub-select is included to improve code re-use. The sub-select can be replaced with a different query that has different selection criteria without affecting the remainder of the query.
A test date range of 1 second in duration was used for analysis. If a 1 second query can be made to run quickly, larger duration queries typically follow suit with their duration approximately proportional to data size. The sample date range was from “2009-06-08 00:00:00” to “2009-06-08 00:00:01”. This resulted in the following query:
SELECT e.id, e.version, e.timestamp, e.visible, u.data_public,
u.id AS user_id, u.display_name, e.changeset_id, e.latitude, e.longitude
FROM nodes e
INNER JOIN (
SELECT id, version FROM
nodes WHERE timestamp > '2009-06-08 00:00:00' AND timestamp <= '2009-06-08 00:00:01'
) t ON e.id = t.id AND e.version = t.version
INNER JOIN changesets c ON e.changeset_id = c.id
INNER JOIN users u ON c.user_id = u.id
This query was evaluated using the EXPLAIN command which produced the following query plan.

Query plan with 1 second interval
This query plan is reasonable and doesn’t explain the poor performance. The large node and medium changeset tables are joined via nested loops, and the small user table is joined via a hash join. This query took less than 1 second to execute. It was apparent that the Java implementation was using a different query plan but not immediately obvious why.
The Java implementation was run again and was still executing several minutes later without result. The database process was examined and was causing heavy disk IO so it wasn’t blocked on a database lock.
Full JDBC tracing was enabled to determine if the JDBC driver was doing anything unusual such as messing with isolation levels. This trace can be enabled in the PostgreSQL driver by adding the following parameter to the connection URL:
loglevel=2
Examination of the trace showed the code performing a standard connection to the database and beginning a transaction as expected, then performing a prepare/bind/execute of the query, reading the query results, then committing the transaction and closing the connection. Nothing unusual was found in the trace.
The only difference that could be found between the Java initiated query and a console query was the use of bind variables. The Java code was modified to avoid the use of bind variables and directly pass the dates within the query. This was tested and found to produce efficient queries.
Some further investigation was done into PostgreSQL query planning and it was found that prepared statements can produce non-optimal query plans due to the fact that a query plan is created prior to the bind variables being available. In our case the time interval is typically small and selects a small subset of the data in the node table. The query planner doesn’t know this and assumes that a much larger dataset will be selected. As a result it resorts to full table scans which perform horribly when a small time interval is being selected. This is mentioned at the following URL:
http://jdbc.postgresql.org/documentation/83/server-prepare.html
This behaviour was verified at the console by preparing a statement and retrieving the query plan as follows:
PREPARE mystatement (timestamp, timestamp) AS
SELECT e.id, e.version, e.timestamp, e.visible, u.data_public,
u.id AS user_id, u.display_name, e.changeset_id, e.latitude, e.longitude
FROM nodes e
INNER JOIN (
SELECT id, version FROM
nodes WHERE timestamp > $1 AND timestamp <= $2
) t ON e.id = t.id AND e.version = t.version
INNER JOIN changesets c ON e.changeset_id = c.id
INNER JOIN users u ON c.user_id = u.id;
EXPLAIN EXECUTE mystatement('2009-06-08 00:00:00', '2009-06-08 00:00:01');
This produced the following plan:

Query plan with bind variables
This plan is using a full table scan on the node table which is completely inappropriate for returning 27 rows from a table containing in excess of 500 million rows. The database server has 32GB of RAM, so smaller tables such as the users and even changeset tables can be scanned without major problems, but the main node/way/relation tables and their child tables must use existing indexes.
Resolution
At this point we knew that prepared statements were causing non-optimal query plans but had to find a way to resolve it. We could modify the queries to avoid bind variables, but this opens up a can of worms with having to format dates correctly and deal with potential timezone issues, bind variables produce much more reliable code.
After some investigation, two options were pursued.
Option 1
PostgreSQL provides a number of parameters that can be set on a per connection basis to affect how the query plan is built. These parameters are boolean parameters that are true by default. They can be selectively set to false. There are many of these parameters, but several key ones are listed below:
- enable_seqscan – If false, the query planner will avoid sequential scans where possible.
- enable_mergejoin – If false, the query planner will not use merge joins.
- enable_hashjoin – If false, the query planner will not use hash joins.
These parameters were experimented with and it was found that all three had to be set to false in order to make the Java initiated queries perform efficiently. The PostgreSQL documentation specifies that these parameters should be avoided if possible, however they do provide a quick simple way of brute forcing the query planner into giving the desired behaviour.
Option 2
The problem was occurring due to the use of server side prepared statements. This prevents the query planner from getting access to the actual variables that will be used. In this case it is incorrectly assuming that a much larger dataset is going to be returned than is actually the case. The Osmosis application has been designed to retrieve all data in a small fixed number of queries meaning that there are no gains to be had from re-using prepared statement query plans, instead it is much more important to get the best query plan possible.
The PostgreSQL JDBC drivers provide the option of using client side prepared statements. This is discussed at the following URL:
http://jdbc.postgresql.org/documentation/83/server-prepare.html
At the time of writing, this approach is still being investigated but if it can be made to work will provide a cleaner solution than Option 1.
Conclusion
It is often assumed that JDBC Prepared Statements are a better approach than dynamic SQL from both a performance and functional point of view. While they are undoubtedly the right approach to create clean bug free code, this investigation showed that they can be severely detrimental in scenarios where an optimal query plan is critical to meeting performance requirements. In an application where the number of queries is low and the data sizes are large, the overheads of creating query plans from scratch are minimal and the impact of a sub-optimal query plan is expensive.
Prepared Statements should still be the preferred method used for building SQL queries where possible. They are likely to provide the best performance in a high volume online transactional application with large numbers of small queries due to their ability to be cached both at the client and server side. However when performance issues arise, careful attention should be paid to the impact of late-binding of dynamic variables. If bind variables are causing sub-optimal query plans to be generated, measures must be taken to provide manual hints to the query planner or find a way of disabling server side prepared statements.
As with most optimisations, normal Prepared Statement functionality should not be deviated from unless performance issues can be measured and attributed to them. This is another non-obvious issue that a performance engineer needs to be mindful of and identify when it occurs.


