Introduction
Splitfiles are similar to Dockerfiles: each command produces a new commit with a deterministic hash that depends on the current hash and the particulars of a command that's being executed.
Preprocessing
sgr
does some quality-of-life preprocessing to the file before interpreting
it, so that:
- Newlines can be escaped to make a command multiline (
"\\n"
gets replaced with""
) - Parameters are supported (
${PARAM}
is replaced with the value of the parameter that's either passed to theexecute_commands
as a dict or to the commandlinesgr build
as a series of arguments (-a key1 val1 -a key2 val2...
)).
Supported commands
The Splitfile executor currently supports the following commands:
IMPORT
: import data from other Splitgraph images or Postgres schemata (including FDW mounts)SQL
: run SQL statements referencing data in the current image or other Splitgraph images.FROM
: derive images from other images or perform multistage builds.
It is also possible to add custom commands to the Splitfile executor. They also follow the same caching rules but are currently not supported by provenance tracking.
Note on Splitfile safety
The Splitfile executor validates SQL in IMPORT
and SQL
commands before
passing it to PostgreSQL for execution. This filters out most PostgreSQL syntax
constructs that Splitfiles cannot use or references to system tables. However,
this shouldn't be relied on for security. In particular, Splitfile validation is
currently not available on Windows systems.
Always check Splitfiles before running them and check provenance of datasets you
pulled from the Internet (with
sgr provenance --full
) before
running sgr rebuild
, as rebuilding
runs SQL from the image's metadata, which could be arbitrary and malicious.
Repository lookups
Currently, a repository name is resolved as follows:
- See if it exists locally. If it does, try to pull it (to update) and use it
for
FROM
/IMPORT
commands. - If not, see if it's specified in the
SG_REPO_LOOKUP_OVERRIDE
parameter which has the formatrepo_1:user:pwd@host:port/db,repo_2:user:pwd@host:port/db...
. Return the matching connection string directly without testing to see that the repository exists there. - If not, scan the
SG_REPO_LOOKUP
parameter which has the formatuser:pwd@host:port/db,user:pwd@host:port/db...
, stopping at the first remote that has it.