Using Splitfiles to build datasets
Although you can create new images using sgr commit
, there is a better way:
Splitfiles. Splitfiles are a declarative way to define new datasets inspired by
Dockerfiles. The same ideas hold here: every command defines a new image with a
deterministic hash (a combination of the hash of the previous image and the
specifics of the command). If the hash that's about to be produced by a command
already exists, sgr
checks out the image instead, avoiding the actual data
transformation.
This means that you can not only easily produce derivative datasets, but also keep them up to date without extra effort.
For more information on the available Splitfile commands, see the Splitfile command reference.
Example
In this demo, we'll run an SQL JOIN query on the two repositories that we produced to match up the keys from the first table with the keys from the second table.
sgr
comes with a generator for this example Splitfile:
$ sgr example splitfile example/repo_1 example/repo_2 | tee example.splitfile
# Import the table from the latest image in the first repository
FROM example/repo_1 IMPORT demo AS table_1
# Import the table from a certain image (passed in as a parameter) in the second repository
FROM example/repo_2:${IMAGE_2} IMPORT demo AS table_2
# Create a join table
SQL {
CREATE TABLE result AS
SELECT table_1.key, table_1.value AS value_1,
table_2.value AS value_2
FROM table_1 JOIN table_2
ON table_1.key = table_2.key
}
This file consists of three commands:
- Import the first
demo
table into the new image and call ittable_1
. One thing to note is that importing doesn't actually consume extra space in the image, since it just links the image to the original object representing the firstdemo
table. Since we didn't specify an explicit image or tag, the latest image inexample/repo_1
will be used. - Do the same with the
example/repo_2
repository, but this time use a parameterIMAGE_2
for the image. We will need to substitute this parameter at Splitfile execution time to define the exact image we will want to use. - Finally, run an SQL statement against the newly built image with the two
imported tables. In this case, we will create a new table (called
result
) and put the results of the JOIN query in it.
Let's run this Splitfile using the original data in the example/repo_2
repository:
$ sgr build example.splitfile -o example/output --args IMAGE_2 original_data
Executing Splitfile example.splitfile with arguments {'IMAGE_2': 'original_data'}
Step 1/3 : FROM example/repo_1 IMPORT demo AS table_1
Resolving repository example/repo_1
Importing 1 table from example/repo_1:103cb2da2da0 into example/output
---> 27aeabd37f2c
Step 2/3 : FROM example/repo_2:original_data IMPORT demo AS table_2
Resolving repository example/repo_2
Importing 1 table from example/repo_2:97900a827962 into example/output
---> ee4dbe8a8167
Step 3/3 : SQL CREATE TABLE result AS SELECT table_1.key, table_1.va...
Executing SQL...
Committing example/output...
Processing table result
Computing table partitions
Indexing the partition key
Storing and indexing the table
100%|██████████████████| 1/1 [00:00<00:00, 10.72objs/s]
---> 4fd86d48715d
If you have seen the output of Docker executing a Dockerfile, this will also look familiar to you. Let's inspect the repository that we've just created:
$ sgr log example/output
H-> 4fd86d48715d2a3e250810fbdaa43d21ed0a894cd00de9bc118d9191e6294741 2020-04-06 10:23:47.448137 CREATE TABLE result...
ee4dbe8a816776012878cea4e593a6b564f98bf4b15f4255f2e6bf2f51a6f113 2020-04-06 10:23:47.042485 Importing 1 table from example/repo_2
27aeabd37f2ce3ef3b196b1fab541ae4aef379b0ae00b8271eab4297db38180c 2020-04-06 10:23:46.612629 Importing 1 table from example/repo_1
0000000000000000000000000000000000000000000000000000000000000000 2020-04-06 10:23:46.304362
$ sgr show example/output:latest
Image example/output:4fd86d48715d2a3e250810fbdaa43d21ed0a894cd00de9bc118d9191e6294741
CREATE TABLE result AS SELECT table_1.key, table_1.value AS value_1,...
Created at 2020-04-06T10:23:47.448137
Size: 2.61 KiB
Parent: ee4dbe8a816776012878cea4e593a6b564f98bf4b15f4255f2e6bf2f51a6f113
Tables:
result
table_1
table_2
Let's also look at the actual data we produced. Since we ran the Splitfile
against the original version of the example/repo_2
repository, the join table
should include all keys from 0 to 9:
$ sgr sql --schema example/output "SELECT * FROM result ORDER BY key"
0 5feceb66ffc86f38d952786c6d696c79c2dbc239dd4e91b46729d73a27fb57e9 5feceb66ffc86f38d952786c6d696c79c2dbc239dd4e91b46729d73a27fb57e9
1 6b86b273ff34fce19d6b804eff5a3f5747ada4eaa22f1d49c01e52ddb7875b4b 6b86b273ff34fce19d6b804eff5a3f5747ada4eaa22f1d49c01e52ddb7875b4b
2 d4735e3a265e16eee03f59718b9b5d03019c07d8b6c51f90da3a666eec13ab35 d4735e3a265e16eee03f59718b9b5d03019c07d8b6c51f90da3a666eec13ab35
3 4e07408562bedb8b60ce05c1decfe3ad16b72230967de01f640b7e4729b49fce 4e07408562bedb8b60ce05c1decfe3ad16b72230967de01f640b7e4729b49fce
4 4b227777d4dd1fc61c6f884f48641d02b4d121d3fd328cb08b5531fcacdabf8a 4b227777d4dd1fc61c6f884f48641d02b4d121d3fd328cb08b5531fcacdabf8a
5 ef2d127de37b942baad06145e54b0c619a1f22327b2ebbcfbec78f5564afe39d ef2d127de37b942baad06145e54b0c619a1f22327b2ebbcfbec78f5564afe39d
6 e7f6c011776e8db7cd330b54174fd76f7d0216b612387a5ffcfb81e6f0919683 e7f6c011776e8db7cd330b54174fd76f7d0216b612387a5ffcfb81e6f0919683
7 7902699be42c8a8e46fbbb4501726517e86b22c56a189f7625a6da49081b2451 7902699be42c8a8e46fbbb4501726517e86b22c56a189f7625a6da49081b2451
8 2c624232cdd221771294dfbb310aca000a0df6ac8b66b696d90ef06fdefb64a3 2c624232cdd221771294dfbb310aca000a0df6ac8b66b696d90ef06fdefb64a3
9 19581e27de7ced00ff1ce50b2047e7a567c76b1cbaebabe5ef03f7c3017bb5b7 19581e27de7ced00ff1ce50b2047e7a567c76b1cbaebabe5ef03f7c3017bb5b7
Finally, since each image at execution time has a deterministic hash, rerunning the same Splitfile won't actually perform any computation:
$ sgr build example.splitfile -o example/output --args IMAGE_2 original_data
Executing Splitfile example.splitfile with arguments {'IMAGE_2': 'original_data'}
Step 1/3 : FROM example/repo_1 IMPORT demo AS table_1
Resolving repository example/repo_1
---> Using cache
---> 27aeabd37f2c
Step 2/3 : FROM example/repo_2:original_data IMPORT demo AS table_2
Resolving repository example/repo_2
---> Using cache
---> ee4dbe8a8167
Step 3/3 : SQL CREATE TABLE result AS SELECT table_1.key, table_1.va...
---> Using cache
---> 4fd86d48715d
Sample Splitfiles
RDU weather data
This is a self-contained example that doesn't pull datasets from Splitgraph: it uses the RDU weather dataset from data.gov and demonstrates building and rebuilding Splitfiles.
2016 US Election
This example uses a Splitfile to build a dataset that joins several datasets together:
- US Census demographic data (source)
- Census tracts designated as Qualified Opportunity Zones (source)
- 2016 US Presidential Election precinct-level returns (source)
It then runs a quick linear regression against the resultant dataset to see if there is a correlation between the voting patterns in a given county and the fraction of QOZ-qualified Census tracts in that county.
Other Splitfiles
There are multiple loose Splitfiles that you can run against various datasets in Splitgraph on GitHub.