Splitgraph has been acquired by EDB! Read the blog post.
Jun 24, 2020 · By Miles Richardson, Artjoms Iškovs
READING TIME: 6 min

Welcome to Splitgraph

Announcing Splitgraph, a data versioning and management system that allows you to work with data like you work with code.

Introduction

Splitgraph is a tool that makes working with data as easy as working with code. An oft-repeated trope is that data scientists spend 80% of their time cleaning data. This may not be strictly true, but certainly there is significant overhead for every deliverable. Software engineers have tools like Git and Docker that allow them to iterate and confidently deploy their code to production. For data scientists, the workflows in the experimentation phase are not nearly as productive. Too much effort is spent sourcing, cleaning and merging data together. And when it’s time to deploy a deliverable, it’s often impossible to keep it up to date without some sort of manual and brittle effort.

In a lot of ways, the domain of data science resembles system administration prior to the age of dev-ops and containerization. It used to be that every server was its own special snowflake, with brittle scripts holding it together and all sorts of failure modes when deploying new code. But “infrastructure-as-code” and technologies like Docker changed that for the better. We think that data science has similar problems. Whether you’re deploying ML models, generating reports, or building dashboards, maintaining them seems to require a lot of effort, and it seems like you’re repeating yourself every time you start a new project.

We’re building Splitgraph to address these problems. It’s inspired by Docker and Git, so its semantics and conventions should feel somewhat familiar. And it’s powered by Postgres, so it will work with nearly all your existing tools.

Where does 80% of the time go?

When starry eyed graduates and machine learning PhDs join the workforce as “data scientists,” they dream of spending all day training neural networks on GPU clusters, and applying Gaussian processes and support vector machines to their company’s databases to unlock new sources of business value. Turns out, there’s a lot more grueling, repetitive work to do before any of that can happen. Questions start to arise, like:

  • Which of these dozens of databases in the company warehouse has the data I need?
  • How can I get this data into the form I need it?
  • Where can I find some public data to augment this proprietary information?

When they finally manage to create something useful, a whole new set of questions arises:

  • Where can I deploy my version of this data?
  • How can my teammates pull my data and iterate on it?
  • When the upstream data sources change, how do I keep my deliverable up to date?

Sometimes, the data scientist can’t even answer these questions, and needs to pass it off to a data engineer instead, sort of like how developers used to need to send their code to a deployment team.

What do we like about software engineering workflows?

Software engineering has a lot of concepts that have made software more pleasant to build and maintain. Years of pain have led developers to coalesce around common workflows for versioning, building, packaging and deploying code:

  • Versioning and revision control allows for rapid experimentation and collaboration. Engineers can track and revert every change to code.
  • Package managers put millions of libraries at your fingertips, and integrating them into your project is as simple as running a command
  • Docker helps build your code into self-contained images in a common format that you can drop into any stack
  • With “infrastructure as code,” software engineers can control the deployment, provisioning and scaling process with repeatable, version-controlled definition files

Finally, good tools that implement all of this enhance existing abstractions without breaking them. You don’t need to adopt a new filesystem or use a specialized IDE to version your code with Git. You can build Docker images without changing your code to work in Docker or recompiling it to use some special Docker system calls. Git and Docker are good tools because they stay out of the way while still adding value.

Introducing Splitgraph

Splitgraph aims to bring these ideas and concepts into data science.

Splitgraph makes it easy to package datasets into self-contained data images that you can combine, query, extend and share with other Splitgraph instances.

Splitfiles give you a declarative language, inspired by Dockerfiles, for expressing data transformations in ordinary SQL familiar to any researcher or business analyst. You can reference other images, or even other databases, with a simple JOIN.

When you build data with Splitfiles, you get provenance tracking of the resulting data: it's possible to find out what sources went into every dataset and know when to rebuild it if the sources ever change. You can easily integrate Splitgraph into your existing CI pipelines, to keep your data up-to-date and stay on top of changes to upstream sources.

Splitgraph images are also version-controlled, and you can manipulate them with Git-like operations through a CLI. You can “check out” any image into a PostgreSQL schema and interact with it using any PostgreSQL client. Splitgraph will capture your changes to the data, and then you can commit them as delta-compressed changesets that you can package into new images.

Finally, Splitgraph isn't opinionated and doesn't break existing abstractions. To any existing PostgreSQL application, Splitgraph images are just another database. We have carefully designed Splitgraph to not break the abstraction of a PostgreSQL table and wire protocol, because doing otherwise would mean throwing away a vast existing ecosystem of applications, users, libraries and extensions. This means that a lot of tools that work with PostgreSQL work with Splitgraph out of the box.

We collect examples of the most interesting ones on the integrations page.

Conclusion

When we started building Splitgraph in 2018, there was nothing around that combined all these concepts into one product. Things are beginning to change and a lot of interesting tools are getting released to help data work. However, we still don't feel like any of them completely satisfy our vision. We list applications and services that are most similar to Splitgraph on our Frequently Asked Questions page.

Over the next few weeks, we will be publishing a series of posts that dive deeper into our philosophy, how Splitgraph works internally and some example use cases for Splitgraph.

In the meantime, Splitgraph is fully functional and ready to use. Head on to our documentation page for installation instructions, check out possible use cases or take a look at our integrations page to see how Splitgraph can help you.

Next Post
 
It took 10 minutes to add support for DataGrip to Splitgraph