Pushing datasets
Pushing data to other engines
Like Git, sgr
allows sharing datasets with other sgr
engines in a
decentralized fashion. The
two-engine example
showcases pushing data between two engines linked together with docker-compose.
Merging changes and pull requests
sgr
does not currently support merging changes and pull requests. Instead, it
assumes images are immutable and on push, it uploads all images that do not
exist on the remote engine. You can optionally overwrite image tags that already
exist, or push/pull only a single image.
We have found that pull requests aren't as useful for continuously changing datasets as they are for code. This is because most workflows involve ingesting data from outside sources (CSV files, scraped data, other databases) into the dataset, so it's better to apply any changes or patches to the source or intermediate dataset after it's already been ingested.
In this respect, sgr
is closer to Docker than to Git. There is no point in
submitting a pull request to a Docker image, as any results will be overwritten
the next time the image is built. Instead, developers propose changes to the
code that builds the image. Similarly, sgr
supports the workflow in which
developers submit pull requests to the Splitfile that builds the dataset,
allowing proposed changes to persist in future datasets.
You can approximate a workflow similar to pull requests by using tags and
forwarding a tag to the agreed-upon "master" version of the dataset. By default,
like Docker, sgr
includes a latest
tag which is a "floating" tag to the
image most recently pushed to its repository.