Layered querying
sgr
tries to not redo work that the database can do more efficiently. You can
check out a Splitgraph image into native PostgreSQL tables, so you can create
necessary indexes to maximize performance for specific queries.
However, this workflow isn't suitable for exploratory queries on large datasets
where checking out an image takes too long. This is why sgr
supports querying
images directly without materializing them into PostgreSQL tables.
This is done by so-called layered querying. Instead of loading an image and its
data into PostgreSQL tables, sgr
"checks out" an image into a schema
consisting of foreign tables. They still present themselves as normal tables to
any application interacting with the engine, but behind the scenes, the
engine-side of the Splitgraph library interprets them. In particular, this lets
sgr
inspect object metadata and determine which objects it needs to download
to satisfy the query on the fly.
Layered querying takes its name from the way Docker works, where an application running inside of a container thinks that it's interacting with a root filesystem, but is actually interacting with an OverlayFS mount that directs reads to the relevant layer.
sgr
uses a fork of
Multicorn, an extension for writing foreign data
wrappers (FDWs) in Python. This fork includes changes so that no data actually
passes through the Python runtime. sgr
downloads required objects and
materializes required table regions. The FDW performs scans using native C code,
which means layered querying competes with and sometimes beats PostgreSQL on
query performance, especially on analytical queries.
Layered querying supports both reads and writes.
The Splitfile executor uses layered querying to satisfy IMPORT
and SQL
commands that read from other Splitgraph images. Currently, it still uses a
standard PostgreSQL checkout when writing to an image.
Layered checkouts
To use layered querying, check out a pulled dataset with the --layered
flag.
For example:
$ sgr checkout --layered splitgraph/2016_election:latest
You do not need to change any application that queries this schema or download
any objects: sgr
will do it lazily when a client queries the table.
When to use layered querying
Depending on the use case, the dataset and the query, layered querying can be faster or slower than querying a native PostgreSQL table.
Layered querying (sgr checkout --layered
) should work better for your use case
when:
- You are querying a dataset for which you only have cloned some metadata and
your query is only going to need a small part of the dataset. For example, if
you're filtering on a single value for the primary key or have an equality
filter on a column that has bloom filtering enabled,
sgr
will only download a few objects to satisfy the query. You can check the amount of data that a query will download by running it withEXPLAIN
first. - You have limited disk space.
cstore_fdw
has a massive space advantage over PostgreSQL tables because it organizes data into columns, which lends itself well to compression. This also means that full table scans are going to be faster and have a smaller IO load insgr
. - You are running OLAP-style queries (
GROUP BY
across large amounts of data or wide tables).cstore_fdw
has a large performance advantage over PostgreSQL in that case.
You should instead check an image out into a PostgreSQL schema (sgr checkout
)
if:
- You want to query images that include a lot of upstream changes. Instead of forwarding queries to objects directly, layered querying materializes subsets of objects that overlap into a temporary table, the overhead of which might be bigger than savings from scanning through less data. You can also consider rechunking the image to make it easier to query.
- You want to extract maximum read performance from the image by using PostgreSQL indexes. After you check out the image, it becomes a set of normal PostgreSQL tables that you can create any index on.
- You are intending to run multiple small queries against the dataset.
sgr
's query planning has a considerable latency overhead over PostgreSQL's and so this kind of workload will be more expensive than with native PostgreSQL tables. - You are running OLTP-style queries (lots of single-row updates and reads).
You can find some Jupyter notebooks with benchmarks comparing layered querying and querying native PostgreSQL tables on GitHub.