Splitgraph has been acquired by EDB! Read the blog post.

Table partitioning and time travel queries: Seafowl case study

Nov 29, 2022 · By Peter Neumark

READING TIME: 8 min

Extending Seafowl with WebAssembly

How can Seafowl users extend the database's builtin capabilities in a safe, portable and efficient way? The answer is WebAssembly, read on to learn how!

Do we really need User-Defined Functions?

User-defined functions (UDFs) are bits of application-specific logic which can be invoked within database queries, extending the builtin capabilities of the database. You might be wondering why not keep application-specific logic in the application layer? Are user-defined functions (UDFs) the relic of a past when databases were designed to be directly queried by end-users?

Far from it! Placing domain-specific logic in the database is often the best (sometimes only) way to implement functionality such as:

Security policies such as PostgreSQL's RLS feature, which rely on UDFs to enforce application-specific security policies.
Custom aggregation functions similar to MIN, MAX and AVG which must run within the database to avoid shipping the entire set of input rows to the client to calculate output.
General, application-independent domain specific capabilities such as working with JSON fields, calculating geographical distances or full text search, extending the core DB's capabilities without having to modify the database itself.
Finally, if raw and cleaned data live in the same database, it makes sense to use the database to transform that data with tools like dbt instead of passing it through a separate process -- and UDFs might be needed to implement some special functionality for that. ¹

UDFs are an essential building block supported by the majority of databases one way or another, and in this regard Seafowl is no exception. What makes Seafowl special is its commitment to running "at the edge": as close to the user as possible, which means pretty much anywhere. But if Seafowl is to run anywhere, UDFs referenced by Seafowl queries should be just as portable. How can Seafowl offer UDF authors a platform which makes this possible?

Platforms

Like any program, a UDF targets a specific platform. This platform should be hardware-independent so UDFs keep working if the database is moved to a different machine, but also efficient enough to process large datasets in acceptable time.

Mature databases often have a custom language for defining UDFs such as PostgreSQL's PL/PGSQL or PL/SQL. For simple functions just beyond the capabilities of an SQL statement, these languages offer an ideal solution. For more complex functionality or code with lots of dependencies, many developers choose a general programming language, preferably one used in the application layer.

PostgreSQL and Redis both support authoring UDFs in Lua, for example. As a high-level language, Lua abstracts away the lower layers of the stack such as the CPU's instruction set or the flavor of the C standard library. Instead, the platform becomes the Lua virtual machine and standard library. Lua is designed to be embedded in the host process, allowing scripts to efficiently invoke native functions exposed by the host. It isn't the only scripting language embedded in databases. PostgreSQL for example supports many scripting languages including javascript and python. While the source of the UDF may be portable, Lua, Javascript, Python, Ruby rely on native modules typically written in C or C++ for performance-intensive code such as crypto libraries. These modules are co-located with the database and must be recompiled for each architecture supported by the database, limiting the portability of UDFs using these functions.

Custom logic written in compiled languages such as C++, Rust or Java is more difficult to invoke from a native database process, requiring either running the UDF in a separate process or compiling against the C ABI of the database and executing it with full access to the database. The security implications of the second option rule out this approach for UDFs.

Inter-process communication (IPC) with a separate JVM process provides a portable solution supported by Oracle and IBM DB2, but introduces a performance penalty for each UDF invocation along with additional corner cases arising from the need to synchronize a JVM process with the database. For example, should updating the Java bytecode for a UDF impact running transactions? If not, how many versions of the UDF must be kept in memory and how should the database keep track of which is the current version in each transaction? When is it safe to remove classes upon which a previous version of the UDF depends? If a transaction defining a UDF is rolled back, which classes' bytecode needs to be ejected from the JVM? The list goes on and on, and it's mostly independent of Java, stemming instead from the marriage of a transactional database and another process which dynamically loads UDF code.

To summarize, there's hardly a silver bullet: each solution for providing UDFs has its pros and cons. Worse, none of the solutions considered so far are truly portable in the sense that they can be used in the browser, on dedicated servers and desktops without serious limitations.

Enter WebAssembly

Predating Seafowl and Splitgraph by decades, developers of large native applications faced a similar conundrum. As new devices and platforms appeared, customers wanted to run their applications in previously unsupported environments. Porting existing applications to new platforms was the traditional remedy, a slow and expensive process.

In the 90s, Java tried to address this problem, promising developers they could "write once, run everywhere". As Java eventually found success in the very different niches of backend services and Android devices, its early focus on "running everywhere" understandably changed to working best where it's used.

With the explosion of web apps, the browser appeared as yet another platform for applications to support, albeit a highly unusual one: instead of a CPU instruction set, developers were to target a high level language, JavaScript. Garbage collection came with the package, but integers were out (JavaScript has only floating point numbers). Being so different from other platforms, porting existing applications to the web seemed more difficult than a complete rewrite in many cases.

Even green field projects struggled to find a stack sufficiently portable and efficient to ensure the application's long-term growth in desktop, mobile and browser environments.

As the performance of computers in general and JavaScript engines in particular improved, it became feasible to emulate a CPU in JavaScript. A single C/C++ codebase could now be compiled to run natively on desktops or mobile devices, but also run in the browser with emscripten, which targeted the javascript-powered CPU emulator. It was hacky, but it worked and the world started to take notice.

With growing demand for efficiently running compiled code in the browser, Chrome, Firefox and Safari started to include new virtual machine called WebAssembly (WASM), very different from the high-level javascript VM:

It wasn't optimized for a specific programming language like V8 or the JVM.
It doesn't support sophisticated just-in-time compilation.
It doesn't even have a garbage collector.

Instead, the WebAssembly specification provides a limited set of low-level instructions and linear memory model. Thanks to the relative ease of implementation, today there are many high quality implementations, allowing code targeting WebAssembly to run anywhere from embedded devices to browsers, CDNs, and yes- in databases as well.

WebAssembly as a platform for UDFs

How does WebAssembly as a platform for implementing UDFs compare to scripting languages like Lua or more traditional compiled solutions like Java or C? Since WebAssembly is designed to be used in very diverse environments, it cannot afford to make many assumptions:

Memory is a sequence of bytes: there is no support for typed memory.
As mentioned earlier, there is no garbage collector.
WebAssembly code has no access to host process resources such as the filesystem, stdout or environment variables unless explicitly provided via functions that the WASM module can call at runtime.
Catering to multi-tenant scenarios, hosts have find-grained control over how much computational power they grant a WASM function.
The builtin types are limited to 32 and 64 bit floating point and integer numbers, bytes and booleans.

That last point presents a significant challenge when using WebAssembly for authoring UDFs. Many of the types supported by Seafowl such as TEXT and DECIMAL cannot be simply represented solely by scalar numeric values.

One option is to limit UDFs to consuming and returning types natively supported by WebAssembly (this is ScyllaDB's approach).

For Seafowl's UDFs, we wanted users to be able to work with any datatype including strings, decimal numbers and dates.

In WebAssembly, memory is just a sequence of bytes and pointers are just 32 bit integers, similar to an array index. Any object which can be serialized to a byte stream can be passed by the host to a WebAssembly function by first writing the serialized bytes to WASM memory, then invoking the function with a pointer to the freshly written memory address as its argument. Similarly, the WASM function can serialize it's output as a byte stream to which it returns a pointer.

Seafowl uses MessagePack for serialization - think of it like a faster binary version of JSON.

We're just getting started

Today, Seafowl supports creating UDFs in any programming language which compiles to WASM. Check out our Rust example UDF repository! Any Seafowl type may be used for input values and return type.

One of the biggest advantages of using UDFs is aggregated records within the database. As of 2022 December, Seafowl does not yet support user-defined aggregation functions, but this support is planned. When it does arrive, you can be sure it will also use WebAssembly!

There's an interesting discussion on this approach in this Hacker News thread.↩

Deciding if I'm urban with WebAssembly and Seafowl

Splitgraph has been acquired by EDB! Read the blog post.

Extending Seafowl with WebAssembly

How can Seafowl users extend the database's builtin capabilities in a safe, portable and efficient way? The answer is WebAssembly, read on to learn how!

Do we really need User-Defined Functions?

Platforms

Enter WebAssembly

WebAssembly as a platform for UDFs

We're just getting started

Product

Support

Company

Splitgraph

Splitgraph has been acquired by EDB! Read the blog post.

Extending Seafowl with WebAssembly

How can Seafowl users extend the database's builtin capabilities in a safe, portable and efficient way? The answer is WebAssembly, read on to learn how!

Do we really need User-Defined Functions?

Platforms

Enter WebAssembly

WebAssembly as a platform for UDFs

We're just getting started

(Ab)using CDNs for SQL queries

Seafowl: a database for analytics at the edge

Product

Support

Company

Community

Splitgraph