Or follow us on social media:
Floe is under active development and will enter Beta soon. We’d love for you to help us shape the world's best Lakehouse SQL compute engine.

Surviving Legacy Code: Renegotiating Old Contracts

by·

In our first article, we talked about what it means to care for legacy code. This is what it looks like in practice. How do you go about changing existing, sometimes old, designs? What trouble can you expect behind those harmless lines of code?

Our query engine had lived for years with a 64K limit on rows and on some individual values, especially strings. For a long time, this was perfectly adequate, but with the work we were doing around JSON and geospatial (you may like this article on geo joins), this was starting to get a bit too limiting. Indeed, many of the use cases unlocked from these newly supported types required huge initial input data. Countries borders at a certain precision are only accurate with a high number of points. JSON blobs from other tools are often directly inserted in databases, then processed through an ETL pipeline. So we set out to raise it to 256K. Seems simple enough, right? Well, let's see how this ended up impacting a good part of our stack.

We did not have these limits in place for arbitrary reasons: they were tied to a compact format, with lengths encoded in 2-byte integers, and that simplicity had paid for itself over the years. To move beyond 64K, we had to enter the less elegant world of 3-byte lengths. We were not completely unprepared. Some of the people who had written the existing system were in the room, sketching the first landmarks for us: data structures that would need extra care, places where we could afford a more direct implementation, and improvements that might ease a future move to even higher limits.

Welcome to the Jungle

The road ahead seemed visible enough, so I started with the obvious work. This was not a one-line patch, but it still looked like the kind of refactoring you can push through with enough patience and discipline. The code was sprinkled with assumptions tied to the old limit: 2 for the number of bytes, 16 for the number of bits, 64K, uint16_t, Short. Not too hard to handle, just replacing hard-coded values with well-defined constants.

Other places in the code had more subtle dependencies to these constants. Some storage structures, notably our string dictionaries, depended on 64K memory pages for internal reasons. Some temporary value allocations relied on a simple bump allocator for speed and global memory usage, backed by memory pages that were now too small to hold the larger strings we wanted to support. None of these problems were especially difficult in isolation, but together they started to draw a picture of a system with more hidden roots than we assumed we had. And notably, they started to take more and more time. While we had a not-so-strict deadline, we were starting to stretch our estimates: between the refactoring tasks, the added validation we wanted to provide and the bug investigations each time a new issue presented itself, a couple of months were needed to get through everything.

Backward compatibility: when you have to remember you still have clients

There was another constraint, and it mattered as much as the new limit itself: we could not stop reading the data we already had. This is still a production-grade database, rows written with the old 2-byte format could not simply become inaccessible after an upgrade.

We did consider migration. In theory, we could have rewritten old data into the new format as part of the upgrade process. In practice, the volumes were too large for that to happen smoothly. Any solution that left the database starting up while the rewrite continued in the background brought its own set of headaches (like, "Do you start accepting queries on tables that have not been converted?"). So we went for the simpler solution: supporting both formats. It meant we added some extra logic, which in turn meant a possible performance cost, but it was a cost we could explain, and we already had a way to mitigate it: users could rebuild their data (a simple CTAS would rewrite the data in the new format), in chunks if needed, on their own schedule.

Legacy compatibility did add the kind of logic I usually dream of deleting. Here it was also a boon: it forced us to abandon a whole family of hard-coded assumptions. Once old and new formats had to coexist, the limit could no longer be treated as a simple constant, and it had to become a runtime configuration. This part made the reasons we needed to do the refactoring more direct: We could not just do a search and replace, but had to start sharing the values for the different limits.

That change propagated further than we first expected. It obviously touched the execution engine and its storage layer, but even our client tools, used to import and export data from the database to other systems, were impacted: we had optimizations in there built around these fixed values and now had to fetch them instead. And that is without talking about the backward compatibility that we did want to preserve: old client tools should still work, new client tools should still be able to talk to old systems.

The Empire Strikes Back

Finally, we got there. The engine could handle larger values, and legacy data still worked as expected. However, benchmarks were showing queries running slower than before, without using any of the larger strings in their dataset.

To understand the regression, I need to explain how data moves through our execution engine, once we get in our row format. When there is some work to be done, a node in our query tree will:

  1. start by allocating a Row Packet - essentially just a contiguous memory area of fixed size - to store output rows
  2. for each row to output a. reserve enough space for the largest row it believes it may produce (this avoids extra allocations and branching while computing the row) b. compute values and store them in the row we reserved c. once the row is complete, the allocation is shrunk to its actual size
  3. if the next row does not fit in the current Row Packet, a new one is allocated and the existing packet is pushed up the tree.

This works well, but we can get in some degenerate case if the estimate for the maximum row size cannot be precise. With variable-length data notably, that estimate can fall back to the maximum string size if we cannot do anything smarter. And once we moved from 64K to 256K, that estimate could become 256K as well, which happened to be the size of a whole Row Packet. For the affected nodes, we would reserve one full packet for a row, write relatively small data into it, shrink the row to what was actually needed, but not leave enough room left for a second row. For a 256K packet, we could have as low as a few thousand bytes of data! Queries were getting slower because they were moving almost empty Row Packets around rather than actual data.

Row Packets are one of the backbone structures of our execution engine. Because they are fixed-size, they also feed into memory estimation for the whole query. Changing them meant revisiting not just row storage, but also how we reason about memory and how we schedule concurrent queries so that they all have the memory resources they need.

It felt reasonable to assume that queries working on larger values would need more memory, but existing small queries should remain small, or at least close to it. That meant we could not simply bump the Row Packet size, and we had to make another design change: query nodes needed to ask for a Row Packet size based on the maximum row size they could produce. While we had to make the size dynamic, we did not want to introduce too many combinations - which would have made memory estimation harder to intuitively think about and might have introduced more complexity that what we wanted to deal with at that moment, so we settled on two sizes, 256KB and 1MB. 256KB being the old size, we could preserve the existing behavior for most queries. 1MB packets allowed us to store at least 3 large (256K) rows, which meant at least 3/4th of the row packets would be filled, giving us back the performance we lost. This was still a significant refactoring, because packet size had been treated as fixed in many places, but here again this falls in the same kind of changes: remove constants, add runtime parameters, share the parameters to each module which needed them.

Leaving Markers Behind

The implementation itself followed a fairly simple discipline, even if the change did not. We tried to keep one source of truth for the axioms we wanted to enforce, derive the related constants from it, and replace hard-coded values with something explicit. Where the programming language type system could encode an assumption, we leaned on that, but since most are not that complete, and we had modules in different languages, we also used assertions extensively.


Before:

void storeOutOfLine(const uint8_t *const out, RowAllocator &alloc, const uint16_t length, const char *data) {
    uint8_t *const outOfLine = (uint8_t *)alloc.allocateVariable(length + sizeof(uint16_t));
    *(uint16_t*)outOfLine    = length;
    memcpy(outOfLine + 2, data, length);
    *(intptr_t *)out = (intptr_t)((uint8_t *)outOfLine - out);
}

After:

const uint32_t MAX_STRING_SIZE = 256000;
const uint32_t STRING_LENGTH_SIZE_IN_BYTES = 3;
static_assert(MAX_STRING_SIZE <= (1 << (STRING_LENGTH_SIZE_IN_BYTES * BITS_PER_BYTE))); // The biggest string should always have a length we can represent
typedef struct uint24_t { ... } uint24_t;
typedef uint24_t strsize_t;
static_assert(sizeof(strsize_t) == STRING_LENGTH_SIZE_IN_BYTES);  // We need to make sure we can always store the biggest string we can produce
static_assert(MAX_STRING_SIZE <= MAX_ROW_SIZE);                   // A single value should always be able to fit in a single row.

...

void storeOutOfLine(const uint8_t *const out, RowAllocator &alloc, const strsize_t length, const char *data) {
    uint8_t *const outOfLine = (uint8_t *)alloc.allocateVariable(length + sizeof(length));
    memcpy(outOfLine, &length, sizeof(length));
    memcpy(outOfLine + sizeof(length), data, length);
    *(intptr_t *)out = (intptr_t)((uint8_t *)outOfLine - out);
}

For this kind of change, I find types and assertions more useful than comments scattered through the code. The changes are too broad for local commentary to provide enough context, so apart from a design document, using the compiler as a way to enforce assumptions felt the more solid approach. One issue with assertions is that it is still hard to understand what are the consequences if we want to change the assumption, and that is where comments can still help. If this part of the system needs to change again later, the ideal outcome is that the next engineer can follow those failing assumptions as a trail instead of rediscovering everything from scratch.

Limits shape design

Most of the individual code changes were rather mundane, the interesting part for me was the journey: each time one detail seemed settled, another hidden design choice appeared a little further down the path. After making the mechanical change from 64K to 256K in the code, the code broke because we designed our data structures with that limit in mind for our memory allocations. It broke because client tools had been thought and implemented given specific datum sizes. It broke because we had assumed we could always somewhat fill our row packets rather than moving one row at a time.

Touching limits in the code usually means touching limits in the design: different parts of the system depend on assumptions that were understood when the design was created, or added organically as new features came in. Raising a limit is renegotiating an old contract without tearing everything down and starting over. Worth keeping in mind, both when inheriting legacy code and when deciding which constraints of our own may one day harden into architecture.

Follow us here at Floe using the button above for updates on how we are building a data engine for modern cloud.

Author
Jordi
Follow us