Inspecting images
Inspect image metadata using the sgr show
command:
$ sgr show namespace/repository:image_hash
Inspect tables in a given image using the
sgr table
command:
$ sgr table namespace/repository:image_hash table_name
Inspect object metadata with the
sgr object
command:
$ sgr object object_id
These commands require you to clone the repository (but not its actual data).
It's also possible to inspect the metadata on a remote repository without
cloning any images by passing the -r
or --remote
flag to these commands:
$ sgr show -r data.splitgraph.com splitgraph/geonames:latest
Image splitgraph/geonames:0b77a102cbabe2de8597dcab71f1c333f16f9e1963c9c8982988e10e65989c62
Created at 2019-12-11T12:20:08.447974
Size: 539.95 MiB
Parent: 0000000000000000000000000000000000000000000000000000000000000000
Tables:
all_countries
Example
Let's further inspect the object that the first example image, example/repo_1
,
used.
$ sgr object o26c6d8345cba276f807d7bcf906531568f309c2609a3420d98c01a6c99b166
Object ID: o26c6d8345cba276f807d7bcf906531568f309c2609a3420d98c01a6c99b166
Namespace: example
Format: FRAG
Size: 963.00 B
Created: 2020-04-06 10:17:15.650187
Rows inserted: 10
Insertion hash: 37be08d7c1aca256f5e64860afe7db21e6f7c47b00cdf23f65b84576bc209f41
Rows deleted: 0
Deletion hash: 0000000000000000000000000000000000000000000000000000000000000000
Column index:
key: [0, 9]
value: ['19581e27de7ced00ff1ce50b2047e7a567c76b1cbaebabe5ef03f7c3017bb5b7', 'ef2d127de37b942baad06145e54b0c619a1f22327b2ebbcfbec78f5564afe39d']
There are a few interesting things in this object's metadata. Firstly, it's
stored in the FRAG
format. A Splitgraph table consists of multiple fragments
that can partially overwrite each other.
Every fragment also has a column index: this is a metadata entry showing, for every column, the minimum and maximum values that this fragment affects (deletes or inserts). This is used when querying large remote Splitgraph tables to not download fragments that definitely don't match a given query. For qualifiers on sorted columns (like the primary key), this can mean only downloading one or two fragments to satisfy a query to a table composed of hundreds.
Finally, every fragment also has an insertion and a deletion hash: the insertion hash is the hash of all the rows that this fragment inserts and the deletion hash is the hash of all the rows that this fragment deletes. These hashes are homomorphic (every row is hashed individually and then the hashes are summed up) and are similar to Facebook's LtHash. They have a really important property: the sum of all hashes of individual fragments composing a table is equal to the content hash of a whole table. This is used in deduplicating data and optimizing storage.
sgr
generates the deterministic object ID by combining the content hash
(insertion - deletion hash
) and the hash of the schema of a given object. This
means that Splitgraph fragments are content-addressable.
Now, let's look at the second object that the new version of the demo
table in
example/repo_2:new_data
is linked to.
$ sgr object o4882c12cdc4cb5b4e89481ebdb71585998633f19a72652debbb80980ae0f0b
Object ID: o4882c12cdc4cb5b4e89481ebdb71585998633f19a72652debbb80980ae0f0b
Namespace: example
Format: FRAG
Size: 552.00 B
Created: 2020-04-06 10:18:11.466667
Rows inserted: 4
Insertion hash: aa15011c4af8e7afb877b5de68ab782b69d63720ba5787bec54fbcd6838fb377
Rows deleted: 4
Deletion hash: 7463a64e15187558e6e1096368033dfab422b34503c20e94a83b7a3e4650df68
Column index:
key: [0, 11]
value: ['4a44dc15364204a80fe80e9039455cc1608281820fe2b24f1e5233ade6af1dd5', 'd4735e3a265e16eee03f59718b9b5d03019c07d8b6c51f90da3a666eec13ab35_UPDATED']
Location: created locally
As you can see, this new object is based on the object that was used by the very
first version of the data we generated
(o4882c12cdc4cb5b4e89481ebdb71585998633f19a72652debbb80980ae0f0b
). This is
because the demo data is static and so the first version of the demo
table in
example/repo_1
and example/repo_2
is in fact the same.
Since this object updates and deletes some rows, their values are included in its deletion hash, which is now non-zero.
Finally, the column index now spans all the values of the key
column that were
replaced or added. Hence, when there's a query against this table that is inside
that range, both fragments will be fetched to check if they match that query.
We can also inspect the actual object: every object in the sgr
cache is stored
as a CStore columnar storage file.
$ sgr sql "SELECT * FROM splitgraph_meta.o4882c12cdc4cb5b4e89481ebdb71585998633f19a72652debbb80980ae0f0b"
2 d4735e3a265e16eee03f59718b9b5d03019c07d8b6c51f90da3a666eec13ab35_UPDATED True
3 4e07408562bedb8b60ce05c1decfe3ad16b72230967de01f640b7e4729b49fce_UPDATED True
10 4a44dc15364204a80fe80e9039455cc1608281820fe2b24f1e5233ade6af1dd5 True
11 4fc82b26aecb47d2868c4efbe3581732a3e7cbcc6c2efb32062c08170a05eeb8 True
0 None False
1 None False
sgr
stores the object with the same schema as the original table, plus a
Boolean flag showing whether a row has been updated/inserted (upserted) or
deleted. For deleted rows, sgr
only records the primary key, and pads the rest
of the table with NULL
s.
sgr
does not always store objects in the engine. When you push an image to an
external location, like Splitgraph, sgr
usually uploads them to an
S3-compatible storage location. sgr
only downloads the actual objects and
loads them into Postgres when you check out an image containing the objects.