Range and bloom indexes
Every Splitgraph object has an index in its metadata. This is distinct from
Postgres indexes, which you can create on checked out Splitgraph tables. sgr
indexes are higher-level and are used by layered querying to
determine which objects sgr
needs to download to satisfy a query.
The main idea behind sgr
indexes is that they give false positives but don't
give false negatives: they either say that an object might be pertinent to a
query (because it adds, deletes or modifies a row that is needed by that query)
or that an object definitely doesn't match a query, in which case it's not
downloaded or used at all.
Currently, sgr
supports range and bloom indexes.
Range indexes
A range index records the minimum and maximum values of every column in the given object. This is most immediately useful for speeding up queries that do comparisons on the primary key or columns that are dependent on it and whose ranges don't overlap between objects and is not going to help queries on columns that have overlapping ranges between objects.
Bloom indexes
sgr
allows the user to add an extra
bloom filter index on columns that
have unique values but still have their ranges overlap.
Bloom filters can use a small amount of space to make a judgement on whether a
set definitely doesn't or might have a certain value. The
re-indexing example shows adding a bloom filter index on
the county name to a dataset of precinct-level results of the 2016 US
Presidential Election, allowing sgr
to filter on county names with just a few
KB of filter data and a 1% false positive rate.
Adding indexes
Despite being stored in object metadata, indexes are added to tables, rather
than objects. This is because the index for an object also includes old values
that this object might have overwritten, so sgr
needs some context when
creating an object index.
Indexes can be added at commit time by passing --index-options
to
sgr commit
:
$ sgr commit namespace/repository --index-options \\
'{\\
"table_name": {\\
"bloom": {\\
"column_1": {\\
"probability": 0.01\\
}\\
},\\
"range": ["column_2", "column_3"] \\
}\\
}'
Re-indexing
You can re-index existing objects, adding new indexes to their metadata. Just
like with index creation, sgr
does this on tables, rather than on objects.
sgr
currently only supports re-indexing for objects that haven't been delta
compressed (no overlapping objects). Images with overlapping objects should be
rechunked first.
To re-index objects that a table consists of, use
sgr reindex
. For example:
$ sgr reindex -i '{"bloom": {"county_name": {"probability": 0.01}}}' splitgraph/2016_election:latest precinct_results
This will add the new index to the object metadata, keeping existing indexes.
Note that if you wish to push new object metadata after re-indexing, you'll need
to add --overwrite-object-meta
to your
sgr push
command. Pushed objects are not
overwritten by default.
Re-indexing example
Re-indexing and bloom filter example
Sample indexes
This is an object from the splitgraph/geonames
dataset
with a range index and some bloom indexes.
$ sgr object -r data.splitgraph.com ob729bf7d604891a1948140d082429638ae9e8a4041a6fc1c91bcae002c2fce
Object ID: ob729bf7d604891a1948140d082429638ae9e8a4041a6fc1c91bcae002c2fce
Namespace: splitgraph
Format: FRAG
Size: 5.09 MiB
Created: 2019-12-11 12:21:18.084483
Rows inserted: 100000
Insertion hash: 4888af1fca2d422116873c2004c69d5ac39c3a457564ae9c439d8dbb5f33c16a
Rows deleted: 0
Deletion hash: 0000000000000000000000000000000000000000000000000000000000000000
Column index:
cc2: ['AL', 'XK,MK']
dem: [-9999, 4667]
name: ['15th Er Eğitim Tugayı', '’Ržaničino']
latitude: ['39.08', '89']
timezone: ['Antarctica/Syowa', 'Europe/Zaporozhye']
asciiname: ["'Rzanicino", 'tahyna']
elevation: [-15, 4636]
geonameid: [712193, 813493]
longitude: ['10.44833', '55.11867']
population: [0, 38500000]
admin1_code: ['00', 'VO']
admin2_code: ['0', 'undefined = Kalininskiy Rayon']
admin3_code: ['0120', 'VTR31-17']
admin4_code: ['0149', '783524']
country_code: ['AL', 'XK']
feature_code: ['ADM1', 'WLL']
feature_class: ['A', 'V']
alternatenames: ["'Alipania,Al'banija,Albaani,Albaania,Albaaniya,Albaanje,Albainia,Albani,Albania,Albania - Shqiperia,Albania - Shqipëria,Albania nutome,Albanie,Albanien,Albanii,Albanija,Albanio,Albaniya,Albanië,Albanska,Albansko,Albanujo,Albanya,Albanìa,Albanía,Albanïi,Albenia,Albuanii,Albàinia,Albània,Albánia,Alb
ánie,Albánsko,Albânia,Albānija,Alibani,Alibaniya,Alubani,Alubaniya,Alvania,An Albain,An Albáin,An-ba-ni,An-ba-ni (Albania),Arbainiya,Arnautluk,Arnavutluk,Arnavutluk Cumhuriyeti,Lalbanaen,Lalbanän,Orileede Alubaniani,Orílẹ́ède Àlùbàníánì,People's Socialist Republic of Albania,Peoples Republic of Albania,People’s Socialist Republic of Albania,Republic of Albania,Republika Popullore Socialiste e Shqiperise,Republika Popullore Socialiste e Shqipërisë,Republika Popullore e Shqiperise,Republika Popullore e Shqipërisë,Republika e Shqiperise,Republika e Shqipërisë,Sheypeni,Shkipeni,Shkiperiya,Shqipenia,Shqiperi,Shqiperia,Shqipni,Shqipnia,Shqipnie,Shqipnija,Shqipnië,Shqipri,Shqipria,Shqiprija,Shqipëri,Shqipëria,Shqypeni,Shqypni,Shqypëni,a er ba ni ya,alabani'a,alabeniya,alabyaniya,albaneti,albania,albaniya,albany,albanya,albanyh,albeniya,alpeniya,arubania,arubania gong he guo,aൽbeniya,i-Albania,ʻAlipania,Αλβανία,Албания,Албанија,Албанія,Альбанія,Ալբա
նիա,אלבניה,آلبانی,آلبانیا,ألبانيا,ئالبانىيە,ئەڵبانیا,البانیا,البانیه,البانیہ,ܐܠܒܢܝܐ,अल्बानिया,अल्बेनिया,আলবেনিয়া,আলব্যানিয়া,আলব্যানিয়া,અલ્બેનિયા,ଆଲବାନିଆ,அல்பேனியா,అల్బేనియా,ಅಲ್ಬೇನಿಯಾ,അൽബേനിയ,ඇල්බේනියාව,ประเทศแอลเบเนีย,แอลเบเนีย,ແອລເບເນຍ,ཨལ་བཱ་ནི་ཡ།,ალბანეთი,አልባኒያ,ᎠᎵᏇᏂᏯ,អាល់បានី,アルバニア,アルバニア共和国,阿尔巴尼亚,알바니아", 'جنوویک کوسک
یلنے']
modification_date: ['1993-11-01', '2019-12-11']
Bloom index:
timezone: k=10, size 92.00 B, approx. 38 item(s), false positive probability 0.1%
asciiname: k=9, size 166.56 KiB, approx. 80129 item(s), false positive probability 0.2%
country_code: k=9, size 64.00 B, approx. 30 item(s), false positive probability 0.2%
feature_code: k=9, size 512.00 B, approx. 237 item(s), false positive probability 0.2%
feature_class: k=10, size 24.00 B, approx. 7 item(s), false positive probability 0.0%
By default, sgr
adds range indexes to all columns, but they're not very useful
here, since most of them span A-Z. However, this dataset also has a bloom filter
index on asciiname
and country_code
, amongst other columns, which lets us
filter on those columns without downloading the whole dataset.
$ sgr clone splitgraph/geonames
Gathering remote metadata...
Fetched metadata for 2 images, 1 table, 120 objects and 0 tags.
$ sgr checkout --layered splitgraph/geonames:latest
Checked out splitgraph/geonames:0b77a102cbab.
$ pgcli postgres://sgr:password@localhost:5432/splitgraph
Server: PostgreSQL 12.2 (Debian 12.2-2.pgdg100+1)
Version: 2.2.0
Chat: https://gitter.im/dbcli/pgcli
Home: https://pgcli.com
sgr@localhost:splitgraph> EXPLAIN SELECT * FROM "splitgraph/geonames".all_countries WHERE asciiname = 'Cambridge' AND country_code = 'GB';
+---------------------------------------------------------------------------------------------+
| QUERY PLAN |
|---------------------------------------------------------------------------------------------|
| Foreign Scan on all_countries (cost=20.00..209000000.00 rows=1100000 width=190) |
| Filter: (((asciiname)::text = 'Cambridge'::text) AND ((country_code)::text = 'GB'::text)) |
| Multicorn: Objects removed by filter: 109 |
| Multicorn: Scan through 11 object(s) (50.98 MiB) |
| JIT: |
| Functions: 1 |
| Options: Inlining true, Optimization true, Expressions true, Deforming true |
+---------------------------------------------------------------------------------------------+