Tracking dataset provenance
There's one other advantage to using Splitfiles instead of building images manually: provenance tracking. When you create an image with a Splitfile, the metadata of the image includes the specifics of the actual command that created it.
This means that we can output a list of images and repositories that were used in an image's creation.
Example
$ sgr provenance example/output:latest
example/output:4fd86d48715d2a3e250810fbdaa43d21ed0a894cd00de9bc118d9191e6294741 depends on:
example/repo_2:97900a8279627ec22b355d1211dd824f78f5840b6a0e30e1d285f4e5c6120016
example/repo_1:103cb2da2da000f3dce9512e3cc67434d7a3c977c0df259411c73a2687244706
Since the Splitfile in the previous section imported tables
from the two generated repositories, sgr
considers these repositories to be
dependencies of example/output
. (Note that you can still check out and use the
example/output
repository without those two repositories present).
We can also run the dependency query backwards, to find the downstream dependents of a given image:
$ sgr dependents example/repo_1:latest
example/repo_1:103cb2da2da000f3dce9512e3cc67434d7a3c977c0df259411c73a2687244706 is depended on by:
example/output:4fd86d48715d2a3e250810fbdaa43d21ed0a894cd00de9bc118d9191e6294741
We can also use this provenance data to reconstruct a Splitfile that will rebuild the newly created image, even without access to the original Splitfile itself:
$ sgr provenance example/output:latest --full
# Splitfile commands used to recreate example/output:4fd86d48715d2a3e250810fbdaa43d21ed0a894cd00de9bc118d9191e6294741
FROM example/repo_1:103cb2da2da000f3dce9512e3cc67434d7a3c977c0df259411c73a2687244706 IMPORT demo AS table_1
FROM example/repo_2:97900a8279627ec22b355d1211dd824f78f5840b6a0e30e1d285f4e5c6120016 IMPORT demo AS table_2
SQL {CREATE TABLE result
AS SELECT table_1.key
, table_1.value AS value_1
, table_2.value AS value_2
FROM table_1
INNER JOIN table_2 ON table_1.key = table_2.key}
Note this is a more powerful feature than what Dockerfiles offer (in Docker, the
final image is unaware of upstream build contexts). With sgr
, the provenance
information encodes everything necessary to rebuild an image. This means that we
can rebuild the example/output:latest
image against a different version of any
of its dependencies by simply substituting a different image hash into this
regenerated Splitfile:
$ sgr rebuild example/output:latest --against example/repo_2:new_data
Rerunning example/output:4fd86d48715d2a3e250810fbdaa43d21ed0a894cd00de9bc118d9191e6294741 against:
example/repo_2:new_data
Step 1/3 : FROM example/repo_1:103cb2da2da000f3dce9512e3cc67434d7a3c...
Resolving repository example/repo_1
---> Using cache
---> 27aeabd37f2c
Step 2/3 : FROM example/repo_2:new_data IMPORT demo AS table_2
Resolving repository example/repo_2
Importing 1 table from example/repo_2:48407a3cef8a into example/output
---> ca85929a90e7
Step 3/3 : SQL {CREATE TABLE result AS SELECT table_1.key ...
Executing SQL...
Committing example/output...
Processing table result
Computing table partitions
Indexing the partition key
Storing and indexing the table
100%|█████████████████| 1/1 [00:00<00:00, 6.33objs/s]
---> 46b6a361bebd
Since this example uses the same image from example/repo_1
, sgr
is able to
cache the first step of the rebuild. But since we altered the example/repo_2
image (by specifying --against example/repo_2:new_data
), sgr
invalidates the
subsequent layers and reruns their build steps. You can think of this
invalidation in the same way as Dockerfile cache invalidation.
Let's examine the new result:
$ sgr sql --schema example/output "SELECT * FROM result ORDER BY key"
2 d4735e3a265e16eee03f59718b9b5d03019c07d8b6c51f90da3a666eec13ab35 d4735e3a265e16eee03f59718b9b5d03019c07d8b6c51f90da3a666eec13ab35_UPDATED
3 4e07408562bedb8b60ce05c1decfe3ad16b72230967de01f640b7e4729b49fce 4e07408562bedb8b60ce05c1decfe3ad16b72230967de01f640b7e4729b49fce_UPDATED
4 4b227777d4dd1fc61c6f884f48641d02b4d121d3fd328cb08b5531fcacdabf8a 4b227777d4dd1fc61c6f884f48641d02b4d121d3fd328cb08b5531fcacdabf8a
5 ef2d127de37b942baad06145e54b0c619a1f22327b2ebbcfbec78f5564afe39d ef2d127de37b942baad06145e54b0c619a1f22327b2ebbcfbec78f5564afe39d
6 e7f6c011776e8db7cd330b54174fd76f7d0216b612387a5ffcfb81e6f0919683 e7f6c011776e8db7cd330b54174fd76f7d0216b612387a5ffcfb81e6f0919683
7 7902699be42c8a8e46fbbb4501726517e86b22c56a189f7625a6da49081b2451 7902699be42c8a8e46fbbb4501726517e86b22c56a189f7625a6da49081b2451
8 2c624232cdd221771294dfbb310aca000a0df6ac8b66b696d90ef06fdefb64a3 2c624232cdd221771294dfbb310aca000a0df6ac8b66b696d90ef06fdefb64a3
9 19581e27de7ced00ff1ce50b2047e7a567c76b1cbaebabe5ef03f7c3017bb5b7 19581e27de7ced00ff1ce50b2047e7a567c76b1cbaebabe5ef03f7c3017bb5b7
Since the first two rows are now missing from example/repo_2
, the JOIN result
doesn't have them either.
The Splitfile executor still caches images in this case, so if we do a rebuild
against the original version of example/repo_1
, sgr
will check out the first
version of example/output
from the cache:
$ sgr rebuild example/output:latest --against example/repo_2:original_data
Rerunning example/output:46b6a361bebd3c16ea23030f91645f683487690b8c695f4585c1b01238ec1d1d against:
example/repo_2:original_data
Step 1/3 : FROM example/repo_1:103cb2da2da000f3dce9512e3cc67434d7a3c...
Resolving repository example/repo_1
---> Using cache
---> 27aeabd37f2c
Step 2/3 : FROM example/repo_2:original_data IMPORT demo AS table_2
Resolving repository example/repo_2
---> Using cache
---> ee4dbe8a8167
Step 3/3 : SQL {CREATE TABLE result AS SELECT table_1.key ...
---> Using cache
---> 4fd86d48715d