splitgraph.yml reference
Mixins
You can optionally split up your splitgraph.yml
file into multiple, similar to
Docker Compose's override functionality. This allows you, for example, to keep
credentials separate from the repository definitions and not check them into
source control or inject them at runtime using your CI platform's secrets'
functionality.
To reference multiple files, pass several -f
flags to sgr cloud
commands
that expect a splitgraph.yml
file:
sgr cloud load -f splitgraph.yml -f splitgraph.credentials.yml
You can also output the full merged configuration by running
sgr cloud validate
.
Note that currently, each separate file has to be a self-contained valid project. This means that in some cases, you will need to repeat the same configuration for a repository in multiple files (for example, when overriding repository parameters).
splitgraph.yml
format reference
credentials
This section defines credentials that are referenced by specific data source
plugins in the repositories
section.
Example:
credentials:
csv: # This is the name of this credential that "external" sections can reference.
plugin: csv
# Credential-specific data matching the plugin's credential schema
data:
s3_access_key: ""
s3_secret_key: ""
.<credential_name>.plugin
ID of the plugin this credential is for. You can't reuse a credential from one plugin in another, but you can reuse credentials between different repositories that use the same plugin.
.<credential_name>.data
Credential-specific data. This must match the plugin's credentials JSONSchema.
You can use sgr cloud stub
to generate a value for this section that the
plugin will accept.
repositories
This section defines a list of repositories to add/update in Splitgraph, as well as their metadata (README, topics, dataset license etc.) and data source settings (plugin, connection parameters, ingestion schedule).
repositories[*].namespace
Namespace name to set up this repository in.
repositories[*].repository
Name of the repository.
repositories[*].external
Defines configuration for an "external", that is, the external data source settings for a given repository. This section is optional.
.credential_id
UUID of the credential for this plugin to reference. Must be already set up on
Splitgraph in a previous run. This field is output by sgr cloud dump
and is
usually not useful if you're writing a splitgraph.yml
file from scratch.
.credential
Name of the credential for this plugin to reference, if it requires credentials. You must either:
- define this credential in the
credentials
section (required forsgr cloud sync
) - have this named credential already set up on Splitgraph in a previous run,
using
sgr cloud load
or through the GUI
.is_live
Whether to enable live querying for plugins like postgres
, snowflake
,
elasticsearch
, csv
that are based on foreign data wrappers and support it.
If this is enabled, Splitgraph will create a "live" tag in this repository that
you will be able to reference to query data at source without loading it.
.plugin
ID of the plugin used by this repository, for example, dbt
or snowflake
. To
list all available plugins, run sgr cloud plugins
.
.params
Plugin-specific parameters that apply for the whole repository. Must match the
plugin's JSONSchema. Like with the credentials
section, sgr cloud stub
generates a sample value for this field.
Example:
params:
connection: # Choose one of:
- connection_type: http # REQUIRED. Constant
url: "" # REQUIRED. HTTP URL to the CSV file
- connection_type: s3 # REQUIRED. Constant
s3_endpoint: "" # REQUIRED. S3 endpoint (including port if required)
s3_bucket: "" # REQUIRED. Bucket the object is in
s3_region: "" # Region of the S3 bucket
s3_secure: false # Whether to use HTTPS for S3 access
s3_object: "" # Limit the import to a single object
s3_object_prefix: "" # Prefix for object in S3 bucket
autodetect_header: true # Detect whether the CSV file has a header automatically
autodetect_dialect: true # Detect the CSV file's dialect (separator, quoting characters etc) automatically
autodetect_encoding: true # Detect the CSV file's encoding automatically
autodetect_sample_size: 65536 # Sample size, in bytes, for encoding/dialect/header detection
schema_inference_rows: 100000 # Number of rows to use for schema inference
encoding: utf-8 # Encoding of the CSV file
ignore_decode_errors: false # Ignore errors when decoding the file
header: true # First line of the CSV file is its header
delimiter: "," # Character used to separate fields in the file
quotechar: '"' # Character used to quote fields
If sgr cloud stub
outputs a list of options with a "Choose one of" comment,
you should fill out one of the items in the list. For example:
params:
connection:
connection_type: s3 # REQUIRED. Constant
s3_endpoint: "" # REQUIRED. S3 endpoint (including port if required)
s3_bucket: "" # REQUIRED. Bucket the object is in
s3_region: "" # Region of the S3 bucket
s3_secure: false # Whether to use HTTPS for S3 access
s3_object: "" # Limit the import to a single object
s3_object_prefix: "" # Prefix for object in S3 bucket
autodetect_header: true
# ...
.tables
Tables to be created in repository by ingestion jobs and in the "live" tag if
is_live
is enabled.
You can omit this setting by setting it to {}
(empty dictionary). This will
make the plugin introspect the available tables when you run sgr cloud load
or
sgr cloud sync
. In addition, you can run sgr cloud dump
to output the
current settings, including inferred tables and their schemas.
.tables.<table_name>
Settings for a given table.
.tables.<table_name>.options
Plugin-specific parameters that apply to the table, matching the plugin's table JSONSchema. Depending on the plugin, they might be separate parameters that only apply to a table or an override of global repository parameters.
Example for the csv
plugin:
options:
url: "" # HTTP URL to the CSV file
s3_object: "" # S3 object of the CSV file
.tables.<table_name>.schema
Schema of the table (description of columns and their types). Note that currently, a lot of plugins don't support overriding the column names and schemas.
.tables.<table_name>.schema[*].name
Column name, 63 characters or fewer. You can use any characters here but, since Splitgraph uses PostgreSQL, lowercase ASCII names with underscores instead of spaces work best for querying, since you don't have to quote them.
.tables.<table_name>.schema[*].pg_type
Type of the column (see the [PostgreSQL documentation](https://www. postgresql.org/docs/current/datatype.html) for reference). This only works for plugins that support live querying, as they are backed by PostgreSQL foreign data wrappers, letting PostgreSQL cast them at runtime.
.tables.<table_name>.schema[*].comment
Comment on the column.
.schedule
Run ingestion for this data source on a schedule. This creates a new "image" in
the repository on every run. This is only required if you're using Splitgraph to
schedule and orchestrate your ingestion jobs. As an alternative, you can run
sgr cloud sync
from GitHub Actions or GitLab CI to trigger Splitgraph jobs on
a schedule and track their state.
Example:
schedule:
schedule: "0 */6 * * *"
enabled: true
.schedule.schedule
Schedule to run ingestion on, in the Cron format. Only one ingestion job for a given repository can be running at a time. This means that if a job is still in progress while it's time to run the next job, the scheduler will wait until the first job finishes.
.schedule.enabled
Flag to enable/disable the ingestion job.
.tunnel
Flag to indicate that the external data source must be accessed through a
network tunnel (false
by default). See the
tunneling documentation for details.
repositories[*].metadata
This section defines various catalog attributes of the repository that aren't relevant to ingestion but are useful for discoverability and organizing your dataset. Splitgraph will display these on the repository's overview page.
Example:
metadata:
topics:
- analytics
- raw
- postgres
- normalization:none
description: Raw analytics data
sources:
- anchor: Internal company wiki
href: https://www.example.com/wiki/data-stack/postgres
extra_metadata:
data_source:
source: Postgres
normalization: none
readme:
text: |
## Raw data for analytics
Sample README for a dataset
.readme
Main body of documentation for the dataset. You can use Markdown formatting. This can be a file path or a dictionary with a README string (see below).
Example:
readme:
text: |
## Raw data for analytics
Sample README for a dataset
.readme.file
Path to a file. This is the format produced by sgr cloud dump
. sgr cloud
commands prepend ./readmes
to this path when dumping or loading files. To
point this path to the README in the repository's root:
- make an
readmes
directory with an empty.gitkeep
file in it - set
.readme.file
to../README.md
.
.readme.text
Multiline string with the inline README.
.description
Short description of the repository.
.topics
List of arbitrary topics for this repository. Adding topics here lets you filter on them in the catalog and on the search page.
.sources
List of sources for this dataset. The records here will show up in a special section at the top of the overview page.
Example:
sources:
- anchor: Name of the source
href: https://www.example.com
isCreator: false
isSameAs: false
This section is also used to populate the
schema.org metadata on the repository overview
page. In particular, the isCreator
and isSameAs
flags populate the
schema.org creator
and
sameAs
flags, respectively.
.license
Freeform text for the license/restrictions on this dataset, rendered at the top of the overview page.
.extra_metadata
Arbitrary key-value metadata for this repository. This must have two levels of nesting. Example:
extra_metadata:
data_source:
source: Postgres
normalization: none
internal:
creator: Some Person
department: Some Department