This looks extremely cool. This is basically incremental view maintenance in databases, a problem that almost everybody (I think) has when using SQL databases and wanting to do some derived views for more performant access patterns. Importantly, they seem to support a wide breath of SQL operators, support spilling computation state to disk, and it's open-source! Interestingly, it compiles queries to Rust, so an approach similar to Redshift (which compiles queries to C++ programs).
There's already a bunch of tools in this area:
1. Materialize[0], which afaik is more big-data oriented, and doesn't pipe the results back to your database, instead storing results in S3 and serving them.
2. Epsio[1], which I've never used, seems to be very similar to this product, but is closed-source only.
3. When building OctoSQL[2], this capability was also important to me and it was designed from ground up to support it. Though in practice in a tool like OctoSQL it's pretty useless (was a fun problem to solve though).
There's some things I'm curious about:
- Does it handle queries that involve complex combinations of ordering with limits in subqueries? If due to a change in an underlying table a top-n row is added, resulting in moving other rows around (and removing the current n'th) will the subsequent query parts behave as though the order was maintained when computing it, or will it fall apart (imagine a select with limit from a select with bigger limit)?
- Is it internally consistent[3]? They say it's "strongly consistent" and "It also guarantees that the state of the views always corresponds to what you'd get if you ran the queries in a batch system for the same input." so I think the answer is yes, but this one's really important.
Either way, will have to play with this, and dig into the paper (the link in the repo doesn't work, here's an arXiv link[4]). Wishing the creators good luck, this looks great!
"Incremental view maintenance has been for a long time a central
problem in database theory. Many solutions have been proposed for restricted classes of database languages, such as the relational algebra, or Datalog. These techniques do not naturally generalize to richer languages. In this paper we give a general solution to this problem in 3 steps: (1) we describe a simple but expressive language called DBSP for describing computations over data
streams; (2) we give a general algorithm for solving the incremental view maintenance problem for arbitrary DBSP programs, and
(3) we show how to model many rich database query languages
(including the full relational queries, grouping and aggregation,
monotonic and non-monotonic recursion, and streaming aggregation) using DBSP. As a consequence, we obtain efficient incremental view maintenance techniques for all these rich languages."
- We evaluate top-k queries incrementally and the nesting shouldn't be a problem for the engine (or it'd be a bug). If you have an example of a query, we can try it out at our end.
Our guarantee is that we always produce the same answer as if you'd ran the queries in a batch system. All views update together. You can see the computation model here: https://www.feldera.com/blog/synchronous-streaming/
Make a table with one column, x, and insert into it rows with values 1-5, and then 8-20.
Then query it using more or less `SELECT x FROM (SELECT x FROM xs LIMIT 15 ORDER BY x) LIMIT 10`, and then insert 6 into the table. Output should be 1-6, 8-11. Of course as long as the limits aren't merged together during optimisation, that would make the test-case moot.
CREATE TABLE foo (x INTEGER NOT NULL PRIMARY KEY) WITH ('materialized' = 'true') ;
CREATE MATERIALIZED VIEW bar AS SELECT x FROM (SELECT x FROM foo ORDER BY x LIMIT 15) LIMIT 10;
I think Rama [1] (by Nathan Marz behind Apache Storm) is interesting as a "NoSQL" solution for a similar problem space, as I understand it. Impressive if this can support similar scale using only SQL.
There's already a bunch of tools in this area:
1. Materialize[0], which afaik is more big-data oriented, and doesn't pipe the results back to your database, instead storing results in S3 and serving them.
2. Epsio[1], which I've never used, seems to be very similar to this product, but is closed-source only.
3. When building OctoSQL[2], this capability was also important to me and it was designed from ground up to support it. Though in practice in a tool like OctoSQL it's pretty useless (was a fun problem to solve though).
There's some things I'm curious about:
- Does it handle queries that involve complex combinations of ordering with limits in subqueries? If due to a change in an underlying table a top-n row is added, resulting in moving other rows around (and removing the current n'th) will the subsequent query parts behave as though the order was maintained when computing it, or will it fall apart (imagine a select with limit from a select with bigger limit)?
- Is it internally consistent[3]? They say it's "strongly consistent" and "It also guarantees that the state of the views always corresponds to what you'd get if you ran the queries in a batch system for the same input." so I think the answer is yes, but this one's really important.
Either way, will have to play with this, and dig into the paper (the link in the repo doesn't work, here's an arXiv link[4]). Wishing the creators good luck, this looks great!
[0]: https://materialize.com
[1]: https://www.epsio.io
[2]: https://github.com/cube2222/octosql
[3]: https://www.scattered-thoughts.net/writing/internal-consiste...
[4]: https://arxiv.org/pdf/2203.16684