Splitfiles
Splitfiles are inspired by Dockerfiles and are the simplest way to build Splitgraph images.
Here's an example Splitfile:
FROM splitgraph/geonames:latest IMPORT {
SELECT
-- Note we need to escape the closing curly brace here for the
-- Splitfile executor.
substring(alternatenames FROM ',([A-Z]{3\\}),') AS iata_code,
name,
latitude,
longitude,
geonameid
FROM all_countries
WHERE feature_code = 'AIRP'
AND country_code = 'US'
AND substring(alternatenames FROM ',([A-Z]{3\\}),') IS NOT NULL
} AS airports
SQL {
CREATE TABLE us_flights_2009 AS
SELECT
origin_airport AS origin_iata,
ao.name AS origin_name,
ao.latitude AS origin_lat,
ao.longitude AS origin_lon,
destination_airport AS destination_iata,
ad.name AS destination_name,
ad.latitude AS destination_lat,
ad.longitude AS destination_lon,
SUM(passengers) AS total_passengers,
SUM(seats) AS total_seats,
SUM(flights) AS flights
FROM "splitgraph/domestic_us_flights:latest".flights f
JOIN airports ao ON f.origin_airport = ao.iata_code
JOIN airports ad ON f.destination_airport = ad.iata_code
WHERE
EXTRACT('year' FROM fly_month) = 2009
GROUP BY origin_airport, destination_airport,
ao.name, ao.latitude, ao.longitude,
ad.name, ad.latitude, ad.longitude
}
They offer multiple advantages over using ad hoc queries or other tools:
- Dockerfile-like caching: every Splitfile line defines a layer. If data
sources and the actual command haven't changed,
sgr
does not need to rebuild the image. - Referencing other datasets: Splitfiles can import data from other
Splitgraph images (using the
FROM ... IMPORT ...
command) or perform joins on other datasets (by referencing them in theSQL
command as schemata). - Provenance tracking: when you build images with Splitfiles, their metadata includes the sources and commands used to build them. You can inspect images to know exactly where its data came from, and quickly rebuild it when the upstream data changes.
- Full SQL support:
sgr
only does some minor rewriting and validation on SQL in Splitfiles and defers to PostgreSQL for running actual queries, which means that all SQL constructs are supported. - Efficient data imports: The Splitfile executor uses layered querying to import data. This allows it to satisfy queries to huge remote datasets by only downloading a few fragments.
See the Splitfile introduction or the Splitfile reference for more information on using Splitfiles to build datasets.