Chunking data
Usually, sgr
splits large tables into multiple objects. At commit time, it
chunks new tables in an image into fragments of 10000 rows by default. You can
change this by passing --chunk-size
to sgr commit
or changing the
SG_COMMIT_CHUNK_SIZE
configuration flag/environment variable.
By default, sgr
chunks by primary key. You can also sort data inside a chunk
by passing --chunk-sort-keys
to sgr commit
. This lets
cstore_fdw
, sgr
's storage
backend, use its own skip lists to not read parts of a chunk that aren't
interesting to a query, decreasing the amount of disk access.
When you commit a change to a table, sgr
represents that change as multiple
objects that contain row-level additions/deletions and can partially "overlap"
with old objects. sgr
appends these new objects to the end of the table's
object list.
You can override this behavior with --snap
, so that sgr
will repackage all
changed tables as brand new objects.
You can retain existing chunk boundaries when committing new data by passing
--split-changesets
to sgr commit
. In this case, sgr
will split
delta-compressed changes according to the original table chunk boundaries. For
example, if there's a change to the first and the 20000th row of a table that
was originally committed with --chunk-size=10000
, this will create 2
fragments: one based on the first chunk and one on the second chunk of the
table.
Optimal object sizes
There is a performance and space tradeoff regarding the amount of rows and the physical size of every object (chunk) and hence the amount of objects that a single table gets split up to.
- Every extra object in a table has multiple kinds of overhead:
- Fixed storage overhead in
sgr
's metadata (splitgraph_meta.objects
table), including the hashes and the index for that object. - Overhead when scanning through the object index at layered query plan time
- (small) overhead of initializing a scan through the object by PostgreSQL
- Fixed overhead when checking out the table (more deltas have to be applied to the staging area)
- Latency overhead for downloading lots of small objects (each of which is a file) over few large ones
- Fixed storage overhead in
- Datasets with lots of small objects can take better advantage of the object
index: in layered querying,
sgr
might need to download or scan through a smaller fraction of the table to satisfy the query, since the table is more granular. - Tables that share a lot of data with other tables have a better chance of reusing the same fragments when stored as smaller objects.
In practice, we have found that targeting object sizes of a few megabytes
delivers a good balance. It is not currently possible to get sgr
to
automatically target a certain object size, but the required number of rows in
each object can be approximated by knowing the total image size and the amount
of rows in all of its tables.
Rechunking images
An image that includes a lot of changes still holds all of its history, since it
contains all the intermediate objects together with deltas. This method allows
sgr
to perform quicker commits (since a change is appended to the end of a
table's list of objects) at the expense of the image bloating up and having lots
of dead rows. This is sometimes undesirable. If you have multiple images in a
repository that show a dataset's history, you might want to optimize the latest
image in that repository to decrease its size and speed up querying. In
addition, you might want to be able to delete sensitive data from an image and
not have it in the history at all.
To rechunk an image, splitting every table in it into disjoint objects again, check it out and recommit it as a snapshot, adding necessary indexes. For example:
$ sgr checkout my/big_image:latest
$ sgr commit my/big_image --snap --chunk-size 100000
To cleanup "loose" or "dangling" objects that are not linked to any tables, you
can use the sgr cleanup
command.