For the past few months, we've been hard at work on something interesting and it's finally ready for people to start experimenting with.
Seafowl is our new standalone project, inspired by the design of modern cloud data warehouses like Snowflake and BigQuery, as well as the lessons we learned over the past four years of building Splitgraph and the Splitgraph DDN.
It's a CDN and HTTP-cache friendly database designed for analytics at the edge: dashboards, notebooks, visualizations and other data-driven read-intensive Web applications that need to be responsive while operating on large datasets.
curl https://demo.seafowl.io/q/30dcf5da00f94e3fbd28cf98242f7de3b164a440cadfaa6f7de74369ffc2177d.csv \
-i -XGET -H "Content-Type: application/json" \
-d@- <<EOF
{"query": "SELECT country_of_production, COUNT(*) AS count \
FROM supply_chains \
GROUP BY 1 ORDER BY 2 DESC"}
EOF
HTTP/2 200
...
cf-cache-status: HIT
{"country_of_production":"BRAZIL","count":2386600}
{"country_of_production":"ARGENTINA","count":260293}
{"country_of_production":"INDONESIA","count":155751}
{"country_of_production":"ECUADOR","count":96842}
{"country_of_production":"PARAGUAY","count":29848}
...
Seafowl is written in Rust and relies on Apache DataFusion to plan and execute queries. For analytical queries, it's:
parquet_fdw
extensionWe have some benchmarks which aren't as in-depth as TPC-DS, but do show that Seafowl makes a lot of seemingly heavy SQL queries now feasible for powering interactive applications.
For SELECT
queries, Seafowl supports a large subset of the PostgreSQL dialect.
If there's something missing, you can
write a user-defined function
for Seafowl in anything that compiles to WebAssembly.
You can write data to Seafowl with:
CREATE TABLE AS
statements for future dbt support!)You can also "bake" a dataset into a Docker image by bundling Seafowl together with its data. This is a nifty scaling strategy for smaller (well, up to 1GB, which can be as big as the 55M row subset of the New York Taxi dataset we use for our benchmarks) and read-only data.
"The edge" is a handwavy and marketing-y term, but it describes what Seafowl is all about well enough. You can deploy Seafowl to the edge or cache query results at the edge.
Seafowl ships as a single 50MB binary that starts in 10ms, making it easily deployable to modern serverless environments, anywhere in the world. Our own tutorial uses Fly.io as an example.
Seafowl's query execution API follows HTTP cache semantics. This means you can put Seafowl behind a CDN like Cloudflare or a cache like Varnish and have query results cached and delivered to your users in milliseconds.
Even without a dedicated cache, you can get the benefits of caching query results in your user's browser (to see it in action, go to our demo, press F12, go to the Network tab and refresh the page).
Seafowl is now available to download and experiment with, though, like with all early stage projects, it likely has plenty of bugs and missing features. As we continue its development, we might break APIs or change storage formats in a backwards-incompatible way.
Feel free to check out our Observable demo, go through our quickstart or through a longer tutorial in which you'll deploy Seafowl to Fly.io, put it behind Cloudflare or Varnish and reproduce our Observable demo notebook.
When something breaks, please do raise a GitHub issue. And, if it doesn't (or doesn't break badly enough), give us a GitHub star!
There are many features we're planning for Seafowl. Where appropriate, we'll also aim to upstream these changes into DataFusion itself.
Seafowl currently supports a limited subset of PostgreSQL operators and
functions. The most notable missing feature is the JSON data type for storage of
arbitrary semi-structured objects, as well as helper functions like
json_extract_path
/ json_agg
.
See the GitHub issue for more details.
You can currently query Parquet/CSV files over HTTP from Seafowl, but you can't query any other databases "live", like you could with Splitgraph. It's always been difficult for us to write PostgreSQL foreign data wrappers in C, since we'd have to do a lot of work to correctly push down various parts of the PostgreSQL query tree. With the extensibility of DataFusion and the memory safety of Rust, we expect manipulating the query plan to be much easier.
See the GitHub issue for more details.
Splitgraph has always been mostly PostgreSQL-first, with great support for PostgreSQL clients (including ones that run complex introspection queries, like Metabase).
With Seafowl, we flipped this upside down: it's HTTP-first, with basic
PostgreSQL wire protocol support as an afterthought. Perhaps it would be
interesting to polish that more, add more information_schema
tables and make
Seafowl queryable by existing PostgreSQL clients, dashboarding and BI tools.
This would also open up more data sources, for example, using Airbyte to load data into Seafowl.
No open-source project announcement nowadays is complete without teasing a managed version of it.
In our case, we already have Splitgraph and we're currently working on a better UI for it, one that moves us from obscure concepts of "data images" and "repositories" to more familiar ones like "tables" and "databases". Over time, we're also planning on migrating Splitgraph's query execution engine from PostgreSQL to Seafowl.
With that in mind, we could provide a management interface for your Seafowl instance or, later on, a true multi-tenant pay-for-what-you-use analytical query execution API, powered by Seafowl.
In the words of a certain company, "we're just getting started".