It seems like ChatGPT and similar large language models are appearing everywhere these days, from web search to coding, even in the public sector. One reason language models are so appealing is their ability to succinctly summarize and answer questions about large amounts of information such as Wikipedia articles.
That got us thinking: what if we could use a language model to answer questions about data in the Splitgraph DDN? With over 40,000 public repositories, the DDN has a wealth of information available to anyone who can write SQL queries. It turns out OpenAI's GPT is surprisingly good at writing SQL. Does this mean Splitgraph can be used conversationally instead of writing queries? The answer is yes, with a few important caveats. We successfully built a proof-of-concept (available on GitHub), which can answer questions such as "How many public libraries operate in Chicago?" by prompting GPT to generate SQL queries which are then executed on Splitgraph. Read on for the details!
In isolation, language models are only capable of generating or completing text. In order to interact with an end user the way ChatGPT does or to use external data sources like Bing's web search index, OpenAI's GPT must be integrated with other systems.
The open source LangChain project aims to simplify the development of language-model powered applications. Using LangChain, developers create pipelines (called "chains") which send text prompts to — and receive generated text from — language models.
We're not the first to try getting an AI to write SQL queries for us: LangChain includes an "SQL Chain" specifically designed for this purpose. The SQL Chain demo shows GPT is pretty good at figuring out which tables to use in their query and composing a query to answer the original question in the prompt. The question is can it scale to meet Splitgraph's needs?
GPT originally stands for Generative Pre-trained Transformer. Without diving into too much detail, transformers are a type of artificial neural network which process their entire input all at once. Since their introduction in a 2017 paper by Google, transformers have taken the world by storm.
But processing all input "at once" would require arbitrarily large neural nets for arbitrarily large input. For this reason, there is an upper bound on how large the input can be, called the maximum context length.
In order to process longer documents, they must first be segmented into text fragments which stay within this size limit. The language model can operate on a single fragment at any time. The LangChain SQL demo uses a database containing about a dozen tables. The entire database schema fits into the LLM's prompt without reaching the maximum context length, meaning a single text fragment is enough!
With 40,000 repositories, each potentially containing several tables, there is no way to fit the DDN's contents into OpenAI's 4097 word (technically token) context.
Fortunately, having a huge collection of documents doesn't rule out using language models, after all Google or Microsoft integrated their entire web search engine indexes.
Retrieval-Augmented Generation is a technique for using LLMs with large collections of documents. It uses a two-step process to answer a user's question:
The trick is that the document search in step 1 doesn't require all the documents (or their summaries) to fit into the LLM's context. Instead, search is performed by computing the vector similarity of the embeddings for the prompt and each document.
An embedding is a vector of floating point numbers generated by the transformer for a given input. The input could be a single token or a longer text as long as it's not longer than the maximum context length.
OpenAI's models use 1536 dimensional embeddings. In a sense, an embedding is like a hash function which computes a list of floats for any text. Inputs whose computed embedding vectors are close together generally belong to text referring to similar concepts.
There are several vector databases which can be used to efficiently find the closest document embeddings to a given prompt's embedding. Fortunately, we don't have to replace SQL, as the pgvector extension turns PostgreSQL into a vector database.
Needless to say, to compute the vector distance of document embeddings, we need to know the embeddings for each document -or rather each document fragment of maximum context size. So prior to querying the set of all repositories, they must be indexed.
The indexing code is fairly simple. Using the GraphQL API used by the Splitgraph frontend, it's easy to get the data for each repository in a namespace. Since the README's for each repository use Markdown, the entire description of the repository including the table schemas is generated in Markdown as well. LangChain has document loaders for lots of different text formats including Markdown.
OpenAI recommends using cosine similarity when comparing embeddings, which pgvector
supports with the <=>
operator which is applicable to the vector type.
Once GPT generates an SQL query, it must be executed on the DDN.
Splitgraph implements the PostgreSQL network protocol, but some of PostgreSQL's introspection tables aren't available.
Normally, LangChain would read the database schema from these tables, but —lacking these tables— the demo code uses the table schema returned by the GraphQL API instead.
We only tried querying repositories from one namespace, cityofchicago during testing. Occasionally, the SQL query returned by GPT can be invalid, the prompt can grow beyond the maximum context size for tables with lots of columns. Still, overall, the generated queries are surprisingly good. Some examples:
> python3 query.py cityofchicago 'Which residential neighborhood of Chicago is the most expensive?'
SQL query generated by LLM:
SELECT
location_address,
zip_code,
units,
property_name,
property_type,
community_area,
address,
community_area_number
FROM
"cityofchicago/affordable-rental-housing-developments-s6ha-ppgi"."affordable_rental_housing_developments"
ORDER BY
units ASC
LIMIT
5;
> python3 query.py cityofchicago 'How many public libraries operate in Chicago?'
SQL query generated by LLM:
SELECT
COUNT(*)
FROM
"cityofchicago/libraries-2020-circulation-by-location-v5qm-a67f"."libraries_2020_circulation_by_location"
One interesting thing about LLMs is that —depending on the temperature setting— their output can differ significantly even if the input doesn't change.
Large language models are capable of performing many tasks which were previously hard to automate —including the translation of natural language questions into valid SQL queries. The LLM-generated SQL isn't always perfect, but it works surprisingly well surprisingly often.
Check out the code on GitHub. If you're excited about using Splitgraph with LLMs, let us know!