User-defined functions (UDFs) are bits of application-specific logic which can be invoked within database queries, extending the builtin capabilities of the database. You might be wondering why not keep application-specific logic in the application layer? Are user-defined functions (UDFs) the relic of a past when databases were designed to be directly queried by end-users?
Far from it! Placing domain-specific logic in the database is often the best (sometimes only) way to implement functionality such as:
MIN
, MAX
and AVG
which must run within the database to avoid shipping the entire set of input rows to the client to calculate output.UDFs are an essential building block supported by the majority of databases one way or another, and in this regard Seafowl is no exception. What makes Seafowl special is its commitment to running "at the edge": as close to the user as possible, which means pretty much anywhere. But if Seafowl is to run anywhere, UDFs referenced by Seafowl queries should be just as portable. How can Seafowl offer UDF authors a platform which makes this possible?
Like any program, a UDF targets a specific platform. This platform should be hardware-independent so UDFs keep working if the database is moved to a different machine, but also efficient enough to process large datasets in acceptable time.
Mature databases often have a custom language for defining UDFs such as PostgreSQL's PL/PGSQL or PL/SQL. For simple functions just beyond the capabilities of an SQL statement, these languages offer an ideal solution. For more complex functionality or code with lots of dependencies, many developers choose a general programming language, preferably one used in the application layer.
PostgreSQL and Redis both support authoring UDFs in Lua, for example. As a high-level language, Lua abstracts away the lower layers of the stack such as the CPU's instruction set or the flavor of the C standard library. Instead, the platform becomes the Lua virtual machine and standard library. Lua is designed to be embedded in the host process, allowing scripts to efficiently invoke native functions exposed by the host. It isn't the only scripting language embedded in databases. PostgreSQL for example supports many scripting languages including javascript and python. While the source of the UDF may be portable, Lua, Javascript, Python, Ruby rely on native modules typically written in C or C++ for performance-intensive code such as crypto libraries. These modules are co-located with the database and must be recompiled for each architecture supported by the database, limiting the portability of UDFs using these functions.
Custom logic written in compiled languages such as C++, Rust or Java is more difficult to invoke from a native database process, requiring either running the UDF in a separate process or compiling against the C ABI of the database and executing it with full access to the database. The security implications of the second option rule out this approach for UDFs.
Inter-process communication (IPC) with a separate JVM process provides a portable solution supported by Oracle and IBM DB2, but introduces a performance penalty for each UDF invocation along with additional corner cases arising from the need to synchronize a JVM process with the database. For example, should updating the Java bytecode for a UDF impact running transactions? If not, how many versions of the UDF must be kept in memory and how should the database keep track of which is the current version in each transaction? When is it safe to remove classes upon which a previous version of the UDF depends? If a transaction defining a UDF is rolled back, which classes' bytecode needs to be ejected from the JVM? The list goes on and on, and it's mostly independent of Java, stemming instead from the marriage of a transactional database and another process which dynamically loads UDF code.
To summarize, there's hardly a silver bullet: each solution for providing UDFs has its pros and cons. Worse, none of the solutions considered so far are truly portable in the sense that they can be used in the browser, on dedicated servers and desktops without serious limitations.
Predating Seafowl and Splitgraph by decades, developers of large native applications faced a similar conundrum. As new devices and platforms appeared, customers wanted to run their applications in previously unsupported environments. Porting existing applications to new platforms was the traditional remedy, a slow and expensive process.
In the 90s, Java tried to address this problem, promising developers they could "write once, run everywhere". As Java eventually found success in the very different niches of backend services and Android devices, its early focus on "running everywhere" understandably changed to working best where it's used.
With the explosion of web apps, the browser appeared as yet another platform for applications to support, albeit a highly unusual one: instead of a CPU instruction set, developers were to target a high level language, JavaScript. Garbage collection came with the package, but integers were out (JavaScript has only floating point numbers). Being so different from other platforms, porting existing applications to the web seemed more difficult than a complete rewrite in many cases.
Even green field projects struggled to find a stack sufficiently portable and efficient to ensure the application's long-term growth in desktop, mobile and browser environments.
As the performance of computers in general and JavaScript engines in particular improved, it became feasible to emulate a CPU in JavaScript. A single C/C++ codebase could now be compiled to run natively on desktops or mobile devices, but also run in the browser with emscripten, which targeted the javascript-powered CPU emulator. It was hacky, but it worked and the world started to take notice.
With growing demand for efficiently running compiled code in the browser, Chrome, Firefox and Safari started to include new virtual machine called WebAssembly (WASM), very different from the high-level javascript VM:
Instead, the WebAssembly specification provides a limited set of low-level instructions and linear memory model. Thanks to the relative ease of implementation, today there are many high quality implementations, allowing code targeting WebAssembly to run anywhere from embedded devices to browsers, CDNs, and yes- in databases as well.
How does WebAssembly as a platform for implementing UDFs compare to scripting languages like Lua or more traditional compiled solutions like Java or C? Since WebAssembly is designed to be used in very diverse environments, it cannot afford to make many assumptions:
stdout
or environment variables unless explicitly provided via functions that the WASM module can call at runtime.That last point presents a significant challenge when using WebAssembly for authoring UDFs. Many of the types supported by Seafowl such as TEXT
and DECIMAL
cannot be simply represented solely by scalar numeric values.
One option is to limit UDFs to consuming and returning types natively supported by WebAssembly (this is ScyllaDB's approach).
For Seafowl's UDFs, we wanted users to be able to work with any datatype including strings, decimal numbers and dates.
In WebAssembly, memory is just a sequence of bytes and pointers are just 32 bit integers, similar to an array index. Any object which can be serialized to a byte stream can be passed by the host to a WebAssembly function by first writing the serialized bytes to WASM memory, then invoking the function with a pointer to the freshly written memory address as its argument. Similarly, the WASM function can serialize it's output as a byte stream to which it returns a pointer.
Seafowl uses MessagePack for serialization - think of it like a faster binary version of JSON.
Today, Seafowl supports creating UDFs in any programming language which compiles to WASM. Check out our Rust example UDF repository! Any Seafowl type may be used for input values and return type.
One of the biggest advantages of using UDFs is aggregated records within the database. As of 2022 December, Seafowl does not yet support user-defined aggregation functions, but this support is planned. When it does arrive, you can be sure it will also use WebAssembly!