Five minute demo
Introduction
This demo will show you how to use sgr
to:
- query a large remote public data image (dataset) with any of your favorite tools, without having to download it completely
- write a Splitfile that can build a derivative image from it
- publish your derivative image on Splitgraph.
Prerequisites
You need to register on Splitgraph to pull
and publish data. If you followed the
installation instructions, you already did this. You can also
use sgr cloud register
or
sgr cloud login
to register or log
in from the sgr
client.
If you do not wish to register on Splitgraph, you can still use sgr
in a
decentralized fashion, sharing data with other sgr
instances. See the
decentralized demo for an introduction that showcases
importing data from CSV files and tracking changes to images, using sgr
's
version control functionality.
First, make sure that your sgr
engine is initialized and test your connection
to Splitgraph. This demo is written from the point of view of the
splitgraph-demo
user, so replace splitgraph-demo
with your own username in
command invocations.
$ sgr init
Initializing engine PostgresEngine LOCAL (sgr@localhost:5432/splitgraph)...
Waiting for connection......
Database splitgraph already exists, skipping
Ensuring the metadata schema at splitgraph_meta exists...
Running splitgraph_meta--0.0.1.sql
Running splitgraph_meta--0.0.1--0.0.2.sql
Running splitgraph_meta--0.0.2--0.0.3.sql
Installing Splitgraph API functions...
Installing CStore management functions...
Installing the audit trigger...
Engine PostgresEngine LOCAL (sgr@localhost:5432/splitgraph) initialized.
$ sgr sql -r data.splitgraph.com "SELECT splitgraph_api.get_current_username()"
splitgraph-demo
Pulling a data image
The
2016 US Presidential Election precinct-level returns
dataset
(source)
has 2 million rows, takes up 500MB as a CSV file and only 25MB in Splitgraph.
It's a perfect example to showcase sgr
's features.
Let's inspect this image remotely without cloning it, using
sgr show
. The
sgr
client uses the same protocol to communicate
with the registry as with the local sgr
engine, so you can use the same
commands for both local and remote operations.
$ sgr show -r data.splitgraph.com splitgraph/2016_election:latest
Image splitgraph/2016_election:3835145ada3f07cad99087d1b1071122d58c48783cbfe4694c101d35651fba90
Created at 2019-10-10T15:51:41.122370
Size: 26.81 MiB
No parent (root image)
Tables:
precinct_results
Let's also inspect the precinct_results
table in this image. sgr
stores tables as
a composition of multiple
content-addressable, immutable objects.
$ sgr table -r data.splitgraph.com splitgraph/2016_election:latest precinct_results
Table splitgraph/2016_election:3835145ada3f07cad99087d1b1071122d58c48783cbfe4694c101d35651fba90/precinct_results
Size: 26.81 MiB
Rows: 1989234
Columns:
year (integer)
stage (character varying)
special (boolean)
state (character varying)
state_postal (character varying)
state_fips (bigint)
state_icpsr (bigint)
county_name (character varying)
county_fips (bigint)
county_ansi (bigint)
county_lat (numeric)
county_long (numeric)
jurisdiction (character varying)
precinct (character varying)
candidate (character varying)
candidate_normalized (character varying)
office (character varying)
district (character varying)
writein (boolean)
party (character varying)
mode (character varying)
votes (bigint)
candidate_opensecrets (character varying)
candidate_wikidata (character varying)
candidate_party (character varying)
candidate_last (character varying)
candidate_first (character varying)
candidate_middle (character varying)
candidate_full (character varying)
candidate_suffix (character varying)
candidate_nickname (character varying)
candidate_fec (character varying)
candidate_fec_name (character varying)
candidate_google (character varying)
candidate_govtrack (character varying)
candidate_icpsr (character varying)
candidate_maplight (character varying)
id (integer, PK)
Objects:
o960395135f32c2da7b4b9371da06b0a9d00925472df2ec1ef687acf6f24894
o07f95f9944d673d8cc05a5b5174253adc513d377ccd811928ae0fe6923e451
o40e4e46594064e402ccdb914ffb8c9f59267a67b33420cf704073729a6671b
o1ccf32547f73be8047c3de06760b32c1077c3bd0b0d8309575ebe9ab8eb86c
oaf3cca3210e1903f06872d0041097dd284423958352d5cd7b5209eceec6cbf
odc1546c25cac950eca4b551642305d548172b554643f046a16cd9da236c895
oe60d81ef239980ceb5869fd787d159f227b59b66ab6ceef18aed45cd62ae8f
o9e77e782f8bda76be6a3f4d29b82661e32d614f7d63230619e8635b9b2b512
of06f6cfb7bd95047bb49bcd3ab6b98da63c2605aa9659c064cee3bd708b0e4
o0671aabeb9d4ce362515624616a5a202503db99d7a83007bfdf63fd71b91af
...
Now let's actually clone the image. By
default, this will only download the image's metadata. sgr
will not transfer
any actual data until it's required, unless you explicitly ask for it.
$ sgr clone splitgraph/2016_election
Gathering remote metadata...
Fetched metadata for 1 image, 1 table, 20 objects and 1 tag.
Querying the image
To interact with this image, you could check it out, which turns the image into a schema with PostgreSQL tables.
However, imagine if this dataset were a few hundred gigabytes large. Should you really need to download all of it to query a few rows?
Splitgraph objects contain metadata that can help it to discard objects that are irrelevant to a query. The US election dataset is stored ordered by county FIPS code and so queries that filter on this column only need to download a couple of objects instead of the whole dataset.
Let's check out the image into a so called "layered querying" schema. This sets up a Postgres foreign data wrapper for this image that acts as a shim and presents itself as normal Postgres table to clients.
$ sgr checkout --layered splitgraph/2016_election:latest
Checked out splitgraph/2016_election:3835145ada3f.
We're interested in precinct-level election results for District of Columbia (FIPS code 11001).
$ sgr sql -s splitgraph/2016_election "EXPLAIN SELECT \
candidate_normalized, SUM(votes) FROM precinct_results \
WHERE county_fips=11001 GROUP BY candidate_normalized"
GroupAggregate (cost=5698135.18..5699554.45 rows=1 width=64)
Group Key: candidate_normalized
-> Sort (cost=5698135.18..5698608.27 rows=189234 width=30)
Sort Key: candidate_normalized
-> Foreign Scan on precinct_results (cost=20.00..5677020.00 rows=189234 width=30)
Filter: (county_fips = 11001)
Multicorn: Objects removed by filter: 18
Multicorn: Scan through 2 object(s) (2.56 MiB)
JIT:
Functions: 7
...
Here, we can see that this query will only download two objects. Let's actually run it: behind the scenes, this will lazily download required data.
$ sgr sql -s splitgraph/2016_election "SELECT \
candidate_normalized, SUM(votes) FROM precinct_results \
WHERE county_fips=11001 GROUP BY candidate_normalized"
clinton 282830
in 6551
johnson 4906
stein 4258
trump 12723
sgr sql
is a shorthand to run an SQL query
against the current engine. However,
any PostgreSQL client will work with
sgr
, allowing you to work with sgr
using your favorite tools, including
DataGrip,
pgAdmin or
other clients like pgcli or DBeaver.
You can use sgr config -n
to print out a libpq connection string to the
current engine.
$ sgr config -n
postgresql://sgr:supersecure@localhost:5432/splitgraph
Using Splitfiles to build derivative images
Splitfiles are similar to Dockerfiles and are the easiest way to build Splitgraph images. They offer Dockerfile-like caching, provenance tracking, fast rebuilds, joins between datasets and full SQL support.
Let's create a Splitfile that summarizes the vote counts for every candidate in every state.
$ cat <<EOF > votes_by_state.splitfile
FROM splitgraph/2016_election IMPORT {
SELECT candidate_normalized, state_postal,
SUM(votes) AS total_votes
FROM precinct_results
GROUP BY candidate_normalized,
state_postal
} AS votes_by_state
EOF
Build this image. Even though we have it
pulled locally, this isn't necessary to build it: sgr
downloads required image
fragments automatically.
$ sgr build votes_by_state.splitfile
Executing Splitfile votes_by_state.splitfile with arguments {}
Step 1/1 : FROM splitgraph/2016_election IMPORT { SELECT candidate...
Resolving repository splitgraph/2016_election
Gathering remote metadata...
No image/object metadata to pull.
Importing 1 table from splitgraph/2016_election:3835145ada3f into votes_by_state
Processing table sg_tmp_32eba453c2f1dd1a13ca58ae3b901a47
---> ee8dc4bb7c47
Successfully built votes_by_state:ee8dc4bb7c47.
Take a look at the new image and run a query against it.
$ sgr table votes_by_state:latest votes_by_state
Table votes_by_state:ee8dc4bb7c4766cc30593b77e0b6b3dbb2863c06a4ed2274127d558901abeb2f/votes_by_state
Size: 6.38 KiB
Rows: 793
Columns:
candidate_normalized (character varying)
state_postal (character varying)
total_votes (numeric)
Objects:
o91e4939e520a19c02ca876c7c5248dcf96f9c3fa55552c6ed8359caea842e9
$ sgr sql -s votes_by_state "SELECT * FROM votes_by_state LIMIT 10"
FL 72993
IN 603
MD 22131
ND 6397
PA 39004
RI 9439
WY 6904
above NY 1
abraham NY 0
adams MD 44
sgr
caches the outputs of Splitfiles, in the same way that Docker caches
Dockerfile layers. It does not rebuild the image if the source data has not
changed.
$ sgr build votes_by_state.splitfile
Executing Splitfile votes_by_state.splitfile with arguments {}
Step 1/1 : FROM splitgraph/2016_election IMPORT { SELECT candidate...
Resolving repository splitgraph/2016_election
Gathering remote metadata...
No image/object metadata to pull.
---> Using cache
---> ee8dc4bb7c47
Successfully built votes_by_state:ee8dc4bb7c47.
Splitfiles also support joins between Splitgraph images. For an example, or to learn more about Splitfiles, check out the Splitfile reference or some sample Splitfiles.
Publishing a dataset on Splitgraph
Let's push your image to the Splitgraph registry, making it available publicly.
By default, sgr
pushes the image to a repository with the same name in your
own namespace. In the case of this demo user, that means sgr
pushes the
freshly built dataset to
splitgraph-demo/votes_by_state
.
$ sgr push votes_by_state
Pushing votes_by_state to splitgraph-demo/votes_by_state on remote data.splitgraph.com
Gathering remote metadata...
No objects to upload.
Uploaded metadata for 2 images, 1 table, 0 objects and 0 tags.
Setting upstream for votes_by_state to splitgraph-demo/votes_by_state.
$ sgr show -r data.splitgraph.com splitgraph-demo/votes_by_state:latest
Image splitgraph-demo/votes_by_state:ee8dc4bb7c4766cc30593b77e0b6b3dbb2863c06a4ed2274127d558901abeb2f
Importing 1 table from splitgraph/2016_election
Created at 2020-05-22T18:07:44.828748
Size: 6.38 KiB
Parent: 0000000000000000000000000000000000000000000000000000000000000000
Tables:
votes_by_state
sgr
lets you upload a README and a description for your dataset directly from
the CLI. Create a README and a metadata file.
$ cat <<EOF > dataset-readme.md
## votes_by_state
This dataset summarizes votes by state in the 2016 US Presidential Election.
This is also my first dataset (woohoo!)
EOF
$ cat <<EOF > splitgraph.yml
readme: dataset-readme.md
description: US 2016 Election votes summary by state.
EOF
Upload the metadata to Splitgraph (replace splitgraph-demo
with your
username).
$ sgr cloud metadata splitgraph-demo/votes_by_state
README updated for repository splitgraph-demo/votes_by_state.
Description updated for repository splitgraph-demo/votes_by_state.
The dataset is now available on
https://www.splitgraph.com/YOUR_USERNAME/votes_by_state.
sgr
also uploads the image's
provenance to Splitgraph,
allowing anyone to see how the image was built and sourced. In addition, the
2016 Election dataset dependents
page will now list the new image as a dependent.
You also have an OpenAPI-compatible REST API generated for the dataset, provided by PostgREST.
Querying the dataset through the Splitgraph Data Delivery Network
Splitgraph Data Delivery Network is a public SQL endpoint
that lets any client query any dataset that is hosted on or proxied by
Splitgraph. You can run SQL against your data immediately with
sgr cloud sql
:
$ sgr cloud sql 'SELECT * FROM "YOUR_USERNAME/votes_by_state".votes_by_state'
You can also run sgr cloud sql
with no parameters to get a libpq connection
string that you can pass to other PostgreSQL clients. For example:
$ pgcli $(sgr cloud sql)