Archive for August, 2009

When Prepared Statements Go Bad

August 28th, 2009 by Brett Henderson

When writing Java applications that interact with a relational database, it is typical to use the JDBC PreparedStatement class to execute queries and updates.  There are several key reasons for using this class instead of the simpler Statement class:

  • Simplified use of variables.  The use of bind variables avoids having to escape the variables being added to the query.  This avoids dealing with formatting variables date formats, escaping string terminators, and eliminates SQL injection attacks.
  • Query Plan Re-Use.  The removal of dynamic variables from the query string allows the database to cache the resultant query plan.  The database engine will see two queries using the same query string and can re-use the previously calculated query plan.
  • Re-Use of Prepared Statements.  A prepared statement can be re-used for many operations against the database.  This may be used multiple times within a single transaction, or re-used across many transactions using client-side Prepared Statement caching.

There are exceptions to every rule however, and there are cases where Prepared Statements do not provide the optimal solution.  This article discusses a recent scenario in which using Prepared Statement functionality had a serious detrimental impact on performance, and discusses the options available for resolving the issue.

Problem

The Open Street Map project is an Internet community that aims to become the “Wikipedia of Maps”.  The project is building a vector data set containing features such as roads, geographical boundaries, coastlines, buildings, points of interest, and in fact any feature which exists in the physical world.  It has many parallels to Google Maps, but the data is freely available in vector form.  This geo-spatial data set is a technology enabler allowing all kinds of new and innovative tools and applications to be built.

The central Open Street Map database is approximately 1 terabyte in size, and contains many hundreds of millions of rows.  Statistics showing the growth and activity over time can be found at the following URL:

http://wiki.openstreetmap.org/wiki/Stats

A complete snapshot of the dataset without history can be downloaded in compressed XML form and is currently approximately 6.6 gigabytes expanding to approximately 125 gigabytes when uncompressed.  This snapshot known as the “planet” file provides the basis for consumers of the data to build their own databases and applications to use the data.  Due to the size of the planet file, it is not feasible to download too often and therefore consumers cannot get access to new data in real-time.  A new planet file is made available weekly at the following URL:

http://planet.openstreetmap.org

To solve the real-time problem of access to data, a tool called Osmosis is used to extract regular changesets from the database and post them on a public web server for download.  These changesets are available in day, hour, and minute intervals.  Osmosis is written in Java, and uses JDBC to access the production PostgreSQL relational database.  Osmosis queries changes from 8 major tables, and several supporting tables.  It extracts these changes based on timestamp.  It queries for all data created within a particular time interval, and writes the changes to a new changeset file.

Using these change files, it is possible for consumers to keep their own databases synchronised with the production database.  Using the minute changesets, it is possible to keep a replica database within approximately 6 minutes of the primary database.  This near real-time feed eliminates the need for most applications to directly access the production database drastically improving performance for applications and allowing the production database to scale to a much larger user base.

Recent enhancements to Osmosis resulted in many of the queries to be re-written.  Unfortunately, the performance of these queries was unacceptable.  The queries need to execute in such a way that their execution time is proportional to the amount of data being extracted.  The new queries were taking several hours to run regardless of the time interval being queried.  This is bad when extracting day changesets, and completely unworkable when extracting minute changesets.

Diagnosis

The database contains the following primary entity types:

  • Nodes – Represent a single point on a map.  Defined in terms of a latitude and longitude (eg. a point of interest).
  • Ways – Represent a linear path between several nodes (eg. a road).
  • Relations – A logical grouping of several nodes, ways and/or relations.

Each of those tables has a corresponding tag table which allows name/value pairs to be attached to describe the feature.  For example a road may have the tags highway=primary and name=<street name> attached to it.

There are several other key tables in the database, the key ones being:

  • Changesets – All edits are grouped into changesets.  Each node/way/relation is related to a changeset.
  • Users – All edits are performed by a registered user.  Each changeset is related to a user.

The database contains many other tables, but not important to this discussion.  For example, ways have a join table to the node table and relations have a generic join table to nodes, ways and relations.

The problematic query is shown below:

SELECT e.id, e.version, e.timestamp, e.visible, u.data_public,
u.id AS user_id, u.display_name, e.changeset_id, e.latitude, e.longitude
FROM nodes e
INNER JOIN (
SELECT id, version FROM
nodes WHERE timestamp > ? AND timestamp <= ?
) t ON e.id = t.id AND e.version = t.version
INNER JOIN changesets c ON e.changeset_id = c.id
INNER JOIN users u ON c.user_id = u.id

The query could be simplified, but the sub-select is included to improve code re-use.  The sub-select can be replaced with a different query that has different selection criteria without affecting the remainder of the query.

A test date range of 1 second in duration was used for analysis.  If a 1 second query can be made to run quickly, larger duration queries typically follow suit with their duration approximately proportional to data size.  The sample date range was from “2009-06-08 00:00:00” to “2009-06-08 00:00:01”.  This resulted in the following query:

SELECT e.id, e.version, e.timestamp, e.visible, u.data_public,
u.id AS user_id, u.display_name, e.changeset_id, e.latitude, e.longitude
FROM nodes e
INNER JOIN (
SELECT id, version FROM
nodes WHERE timestamp > '2009-06-08 00:00:00' AND timestamp <= '2009-06-08 00:00:01'
) t ON e.id = t.id AND e.version = t.version
INNER JOIN changesets c ON e.changeset_id = c.id
INNER JOIN users u ON c.user_id = u.id

This query was evaluated using the EXPLAIN command which produced the following query plan.


Query plan with 1 second interval

Query plan with 1 second interval

This query plan is reasonable and doesn’t explain the poor performance. The large node and medium changeset tables are joined via nested loops, and the small user table is joined via a hash join.  This query took less than 1 second to execute. It was apparent that the Java implementation was using a different query plan but not immediately obvious why.

The Java implementation was run again and was still executing several minutes later without result. The database process was examined and was causing heavy disk IO so it wasn’t blocked on a database lock.

Full JDBC tracing was enabled to determine if the JDBC driver was doing anything unusual such as messing with isolation levels. This trace can be enabled in the PostgreSQL driver by adding the following parameter to the connection URL:

loglevel=2

Examination of the trace showed the code performing a standard connection to the database and beginning a transaction as expected, then performing a prepare/bind/execute of the query, reading the query results, then committing the transaction and closing the connection. Nothing unusual was found in the trace.

The only difference that could be found between the Java initiated query and a console query was the use of bind variables. The Java code was modified to avoid the use of bind variables and directly pass the dates within the query. This was tested and found to produce efficient queries.

Some further investigation was done into PostgreSQL query planning and it was found that prepared statements can produce non-optimal query plans due to the fact that a query plan is created prior to the bind variables being available. In our case the time interval is typically small and selects a small subset of the data in the node table. The query planner doesn’t know this and assumes that a much larger dataset will be selected. As a result it resorts to full table scans which perform horribly when a small time interval is being selected. This is mentioned at the following URL:

http://jdbc.postgresql.org/documentation/83/server-prepare.html

This behaviour was verified at the console by preparing a statement and retrieving the query plan as follows:

PREPARE mystatement (timestamp, timestamp) AS
SELECT e.id, e.version, e.timestamp, e.visible, u.data_public,
u.id AS user_id, u.display_name, e.changeset_id, e.latitude, e.longitude
FROM nodes e
INNER JOIN (
SELECT id, version FROM
nodes WHERE timestamp > $1 AND timestamp <= $2
) t ON e.id = t.id AND e.version = t.version
INNER JOIN changesets c ON e.changeset_id = c.id
INNER JOIN users u ON c.user_id = u.id;
EXPLAIN EXECUTE mystatement('2009-06-08 00:00:00', '2009-06-08 00:00:01');

This produced the following plan:


Query plan with bind variables

Query plan with bind variables

This plan is using a full table scan on the node table which is completely inappropriate for returning 27 rows from a table containing in excess of 500 million rows.  The database server has 32GB of RAM, so smaller tables such as the users and even changeset tables can be scanned without major problems, but the main node/way/relation tables and their child tables must use existing indexes.

Resolution

At this point we knew that prepared statements were causing non-optimal query plans but had to find a way to resolve it. We could modify the queries to avoid bind variables, but this opens up a can of worms with having to format dates correctly and deal with potential timezone issues, bind variables produce much more reliable code.

After some investigation, two options were pursued.

Option 1

PostgreSQL provides a number of parameters that can be set on a per connection basis to affect how the query plan is built. These parameters are boolean parameters that are true by default. They can be selectively set to false. There are many of these parameters, but several key ones are listed below:

  • enable_seqscan – If false, the query planner will avoid sequential scans where possible.
  • enable_mergejoin – If false, the query planner will not use merge joins.
  • enable_hashjoin – If false, the query planner will not use hash joins.

These parameters were experimented with and it was found that all three had to be set to false in order to make the Java initiated queries perform efficiently. The PostgreSQL documentation specifies that these parameters should be avoided if possible, however they do provide a quick simple way of brute forcing the query planner into giving the desired behaviour.

Option 2

The problem was occurring due to the use of server side prepared statements. This prevents the query planner from getting access to the actual variables that will be used. In this case it is incorrectly assuming that a much larger dataset is going to be returned than is actually the case. The Osmosis application has been designed to retrieve all data in a small fixed number of queries meaning that there are no gains to be had from re-using prepared statement query plans, instead it is much more important to get the best query plan possible.

The PostgreSQL JDBC drivers provide the option of using client side prepared statements. This is discussed at the following URL:

http://jdbc.postgresql.org/documentation/83/server-prepare.html

At the time of writing, this approach is still being investigated but if it can be made to work will provide a cleaner solution than Option 1.

Conclusion

It is often assumed that JDBC Prepared Statements are a better approach than dynamic SQL from both a performance and functional point of view. While they are undoubtedly the right approach to create clean bug free code, this investigation showed that they can be severely detrimental in scenarios where an optimal query plan is critical to meeting performance requirements. In an application where the number of queries is low and the data sizes are large, the overheads of creating query plans from scratch are minimal and the impact of a sub-optimal query plan is expensive.

Prepared Statements should still be the preferred method used for building SQL queries where possible. They are likely to provide the best performance in a high volume online transactional application with large numbers of small queries due to their ability to be cached both at the client and server side. However when performance issues arise, careful attention should be paid to the impact of late-binding of dynamic variables. If bind variables are causing sub-optimal query plans to be generated, measures must be taken to provide manual hints to the query planner or find a way of disabling server side prepared statements.

As with most optimisations, normal Prepared Statement functionality should not be deviated from unless performance issues can be measured and attributed to them. This is another non-obvious issue that a performance engineer needs to be mindful of and identify when it occurs.

Build Pipeline - Software quality through repeatability

August 16th, 2009 by Oscar Huseyin

What are the hallmarks of a quality software development team?  What attributes stand out from software which gives it the feeling of quality?  These questions and many more, lead to the seemingly elusive quality answer.  Certainly, answers to these quality questions have found themselves entrenched in today’s modern software development life cycle process, however, lm continually presented with quality challenges in all projects which l have been fortunate to be a part of.

Throughout my career, l have seen many quality processes and patterns, all with great intentions, both succeed and fail.  Some processes have been too strict and focussed on impractical code coverage targets, whilst others were focused on minimal unit testing but strong focus on integration testing.  Some programs embraced automation, others rejected it.  Coupled with the heterogeneous and diverse business domains, it can be very difficult to select and accept quality patterns as each of them present their own unique challenges to the quality problem.

One thing that l have proven irrefutable is that repeatability is a key enabler of quality.  Repeatability is, after all, the critical component of the Scientific Method which has roots from the past times of Galileo.  Repeatability reinforces the strength of observation, and observation allows identification of patterns, facilitating classification, behavioural analysis and many other exploitable scientific methods of analysis. Repeatability in software development is also a critical dimension in quality assurance. Any process that is automated is highly repeatable and precise, leaving no room for error. This is why computers were created in the first place! However small a task, automation can add significant value to the overall quality dimensions of the system.

On a macro scale, teams within a program will have a number of concerns which they provide services for. For example, the environments or infrastructure teams will build and maintain all environments during the project. They will ensure deployments are executed to the target environments and availability targets of the environments under-management are maintained. The environments team could benefit from an automated deployment system, which is developed to ensure consistency in configuration and deployments. Again, repeatability in deployments increases the quality of the deployments and manifests into higher application availability times and lowers defects relating to environment misconfigurations.

Until now, l’ve described and illustrated that repeatability can increase the quality of deployments if facilitated through an automation framework. This approach is not only relevant for the environments team, and a more holistic approach to automation is required to ensure all facets of the SDLC benefit from the quality enablers of repeatability. This is the task of the Build Pipeline.

The build pipeline has cross-cutting concerns which define the processes for all technology teams in the SDLC. Primarily focussing on development, build engineering and environments teams, the build pipeline formalises and provides clarity into the software development and release processes of the program. Each of the development, build engineering and environments teams will assume custodianship of a portion of the build pipeline to enable the end-to-end artefact creation and assurance process.

Figure 1 - The Build Pipeline

The build pipeline begins with the developers local build. Here, the developer will execute unit testing for the purposes of check-in. Once verified, the process is repeated on the Continuous Integration server, where an additional deployment and integration test cycle is executed. Once verified, the build engineering team will deploy the application into a deployment test environment and reverify. Finally, the environment team will take the, now highly verified software artifacts, and fan-out the deployments into all downstream environments, albeit, System Integration Testing, User Acceptance Testing, even, Production. One key point to mention here, the verification processes in the build pipeline will serve as a quality contract between the teams. For example, the build engineering teams will assert the applications units and integration’s are tested (through unit and integration testing performed in both development and continuous integration) and fail the pipeline operation if this contract is broken.

Realising the build pipeline requires diligence, commitment, collaboration and attention to detail. Once employed, the build pipeline has a proven quality yield that reduces cost through shorter testing cycles and reduced defect numbers. Furthermore, consistent and repeatable deployments through automation will promote the consistency and fault-free environment configurations. Sure, there are upfront costs associated in establishing the build pipeline, however, ignoring the need for one and assuming quality is achievable through our novel methods of the past, have time and time again, proven to be a false economy.