Clone vs checkout
You can clone a Splitgraph repository using the
sgr clone
command:
$ sgr clone some_namespace/some_repository
This will look for this repository in all currently registered remotes in
SG_REPO_LOOKUP
and clone the repository from the first remote that contains
it.
By default, cloning a Splitgraph repository is different than what happens with
Git repositories. In this case, sgr
only clones the repository's metadata,
which is lightweight compared to the actual data but still lets the user get an
overview of the repository.
For example, consider the 2016 US Presidential Election precinct-level returns dataset:
# This only transfers a few KB of metadata.
$ sgr clone splitgraph/2016_election
Gathering remote metadata...
Fetched metadata for 1 image, 1 table, 20 objects and 1 tag.
# How big is the actual image?
$ sgr show splitgraph/2016_election:latest
Image splitgraph/2016_election:3835145ada3f07cad99087d1b1071122d58c48783cbfe4694c101d35651fba90
Created at 2019-10-10T15:51:41.122370
Size: 26.75 MiB
No parent (root image)
Tables:
precinct_results
sgr
tries to be as lazy as possible and only download the actual data when a
query or a checkout requires it. You can override this behavior by passing
--download-all
to sgr clone
.
Checkout
On checkout of an image, sgr
gathers all objects required by that image,
downloads them and assembles them into tables in a process called
"materialization".
$ sgr checkout splitgraph/2016_election:latest
Need to download 20 objects (26.75 MiB), cache occupancy: 492.69 MiB/10.00 GiB
Fetching 20 objects, total size 26.75 MiB
Getting download URLs from registry PostgresEngine data.splitgraph.com (5a87...@data.splitgraph.com:5432/sgregistry)...
100%|███████████| 20/20 [00:12<00:00, 1.66obj/s, object=o1ccf32547...]
Checked out splitgraph/2016_election:3835145ada3f.