Metadata vs data
When you clone an image, or the Splitfile executor encounters one you referenced
in a Splitfile, sgr
only downloads the metadata for the repository (or the
referenced image within the repository). The metadata consists of all sgr
state tables in splitgraph_meta
: images
, objects
, tags
and tables
.
Compared to the actual data, it's very lightweight, most of its footprint being
taken up by the object index.
When you check out an image, sgr
downloads the actual data, which is the set
of objects required by that image. When you query a table from a
layered checkout, sgr
only downloads objects that are
required to satisfy that query, as determined by the object index.
sgr
considers objects downloaded from a remote engine to be "cached," and can
evict them if the free cache space is limited. You can control the cache size by
the SG_OBJECT_CACHE_SIZE
variable in .sgconfig
. The default value is 10240
(10GB).
With layered querying, you can see how much data a query will require before you
execute it, by running EXPLAIN
on the query. For example, in the
2016 US Presidential Election precinct-level returns
dataset:
$ sgr sql -s splitgraph/2016_election "EXPLAIN SELECT candidate_normalized, SUM(votes) FROM precinct_results WHERE county_fips=11001 GROUP BY candidate_normalized"
GroupAggregate (cost=71991481.18..71992900.45 rows=1 width=64)
Group Key: candidate_normalized
-> Sort (cost=71991481.18..71991954.27 rows=189234 width=380)
Sort Key: candidate_normalized
-> Foreign Scan on precinct_results (cost=20.00..71908920.00 rows=189234 width=380)
Filter: (county_fips = 11001)
Multicorn: Objects removed by filter: 18
Multicorn: Scan through 2 object(s) (2.55 MiB)
JIT:
Functions: 7
...
This query will need to scan (and download, if it's not already locally cached) through 2.5 MiB of data.