Frequently Asked Questions
Do I have to use Splitgraph to use sgr
?
No. While we use some parts of sgr
to power
Splitgraph, sgr
is a self-contained
stand-alone tool. You can use it in a decentralized way, sharing data between
two sgr
engines like you would with Git.
Here's an
example
of getting two sgr
instances to synchronize with each other. It is also
possible to push data to S3-compatible storage (like
Minio).
Do I have to download sgr
to use Splitgraph?
No. While Splitgraph is a sgr
peer,
letting you push and pull data between it and your local sgr
instance, a lot
of its functionality doesn't require you to download sgr
.
Is sgr
a PostgreSQL extension?
Not quite. The sgr
engine ships as a
Docker image and is a customized version of PostgreSQL that is fully compatible
with existing clients. In the future, we might repackage sgr
as a PostgreSQL
extension.
Can I add sgr
to my existing PostgreSQL deployment?
While it is possible to add sgr
to existing PostgreSQL deployments, there
isn't currently a simple installation method. If you're interested in doing so,
you can follow the instructions in the
Dockerfile
used to build the engine or contact us.
You can also add the sgr
engine as a PostgreSQL
logical replication client, which will
let you ingest data from existing databases without installing sgr
on them.
Does my data have to be in PostgreSQL to use sgr
?
With mounting, you can query data in other databases
(including MongoDB, MySQL, PostgreSQL or Elasticsearch) directly through
Splitgraph with
any PostgreSQL client. You do not
need to copy your data into PostgreSQL to use sgr
.
Can I use sgr
with my existing tools?
Yes. Any PostgreSQL client is able to query Splitgraph repositories (directly or through layered querying), including DataGrip, pgAdmin or other clients like pgcli or DBeaver.
In addition, sgr
can enhance a lot of existing applications and extensions
that work with PostgreSQL. We have examples of using sgr
with
Jupyter notebooks,
PostGIS,
PostgREST,
dbt or
Metabase.
What's the performance like? Do you have any benchmarks?
We maintain a couple of Jupyter notebooks with benchmarks on our GitHub.
It's difficult to specify what is considered a benchmark for sgr
, as for a lot
of operations one would be benchmarking PostgreSQL itself. This is why we
haven't run benchmarks like TPC-DS on sgr
(since
for maximum performance, it's easy to check out a Splitgraph image into a
PostgreSQL schema) but have tested the overhead of various sgr
workloads over
PostgreSQL.
In short:
- Committing and checking out Splitgraph images takes slightly less time than
writing the same data to PostgreSQL tables (
sgr
moves data directly between PostgreSQL tables without query parsing overhead) - Writing to PostgreSQL tables that are change-tracked by
sgr
is almost 2x slower than writing to untracked tables (sgr
uses audit triggers to record changes rather than diffing the table at commit time). - Splitgraph images take up much less (5x-10x) space than equivalent PostgreSQL
tables due to it using
cstore_fdw
for storage. - Querying Splitgraph images directly without checkout (layered querying) can sometimes be faster and use less IO than querying PostgreSQL tables.
Can sgr
be used for big datasets?
Yes. sgr
has a few optimizations that make it suitable for working with large
datasets:
- Datasets are partitioned into fragments stored in a columnar format which is superior to row-format storage for OLAP workloads.
- You can query Splitgraph images without checking them out or even downloading
them completely. With layered querying,
sgr
can lazily download a small fraction of the table needed for the query. This is still completely seamless to the client application.
Since sgr
is built on top of PostgreSQL, you can use the same methods for
horizontally scaling a PostgreSQL deployment to scale a sgr
engine.