Interview with Nikolas Göbel about Declarative Differential Dataflows (2018)
In this transcript recorded October 2018, Nikolas Göbel and Dustin Getz talk about Incremental Datalog with Differential Dataflows. Nikolas Göbel is the lead developer of declarative-dataflow, a library and server providing a Datalog-inspired interface on top of Differential Dataflow. Dustin Getz is the founder of Hyperfiddle, a new way for Clojure programmers to make web dashboards for Datomic.
Table of contents
The first minute of video and transcript is missing, so if you're lost, glance at Nikolas's high level overview: Incremental Datalog with Differential Dataflows. It's like Datomic Datalog but incremental and suitable for real-time data streams.
Nikolas Göbel: Looking at how to get this to scale to something and they also fiddled around with these kinds of handmade reverse indices and some kind of production rules, style optimizations and doing it in a smart order and something like that. But yeah, I think differential right from the get go, there's still a lot of room for optimization in a web use case with many concurrent users, but from the point of view of updating these things like a query and with low latency and high throughputs, it's very general purpose and it supports all of the cool datalog stuff, so mutual recursion, recursion rules, all this kinds of stuff.
Datomic query parity?
Dustin Getz: Is there anything it can't do?
Nikolas Göbel: Let me think. What it can do for example, is user defined functions, so you don't get them ... because it's not a closure runtime or anything, so you can't just transact any kind of a custom, or use any kind of custom functions within a query.
Dustin Getz: So how would you do ... We have a demo on our homepage http://www.hyperfiddle.net of using clojure.string/includes? from datalog?
Dustin Getz: So other than the limitation of no peer functions, just ... the basics, like the schema model would be identical.
Nikolas Göbel: Yep. So far, we haven't come across anything that's fundamentally different. It's still a work in progress, but the plan is ... if you run this on top of the Datomic, you don't even need to do anything because we just read the schema from the transaction log and set everything up. There might be a thing about mapping data types between Java and Rust, keywords for example, but in principle it should ... Yeah, I think in principle we support the same datatypes.
Dustin Getz: What about speculation as in datomic.api/with?
Nikolas Göbel: Yeah. That's cool. Uh, so for the historical side of things, it would be supported, so it's fully immutable, you can just say "Give me the state of the stream at some transaction in the past", but it's not too interesting, because then it won't change ever after that. But, what we do do is give you this bitemporal modeling, where you can say "Give me the stream as of a time in system time and show me how event time evolves" and the other way round, possibly with more than two dimensions. What you can also do is these speculative queries where you say "I have the stream of results and at some transaction in the past add some speculative facts into the database" and then you will get the diffs from the main line to this new, speculative line. That is also supported.
Dustin Getz: So if I understand, that means as your staging area, that goes to get with changes incrementally, the queries can still react to the changes.
Nikolas Göbel: Yes. If you go back in time and speculate with some transaction, you don't have to recompute everything fully up to where you are now, but instead you get the changes from what your computation did before. Say you compute a query, and then one of the results of the query, x, you basically invalidate x with a speculative transaction. Then within the new line of computation, you won't get the whole resultset minus this invalidated tuple x, rather you will just get (x -1) or "take away this tuple x". And you can keep whatever you had before. It should be, as I said, it's still a work in progress, but it should be pretty efficient. Like having a version of a UI that's rendered with React or something, and I render it twice with a speculative stream and one with the real stream, that should all be performing as well as it could reasonably.
Dustin Getz: Okay. So when you talk about it sitting on Datomic, so you've got Datomic as a source of truth, so you're not even worried about the transaction side.
Nikolas Göbel: Not at all. Yeah.
Dustin Getz: The backend is a Rust server that can run distributed?
Nikolas Göbel: Yes.
Dustin Getz: And, how about a browser?
Nikolas Göbel: Yeah, the WebAssembly story again. So it started as a project that was intended to run in the browser. Not differential itself, but just the Datalog stuff on top of it, and it works. It runs in the browser, but you're running a server-grade thing with lots of systems level features in a single-threaded browser environment. That doesn't make sense, but in principle you could. The intersection would be ... the client in the browser is running something like DataScript, and you register the needs of all of your clients as Datalog with the server, and then you feed the result stream that you're getting into DataScript or into multiple DataScript instances and then you just work with them locally.
Dustin Getz: If you register the queries with the server, then the query can't change, and what's the point of having DataScript at all?
Nikolas Göbel: That's true. If you do some kind of synchronization between server state and local only state for a web application, and you want to have UI state handled in DataScript as well, then you would use the server part to feed data in and then deal with it locally.
Dustin Getz: Okay. But for getting local state to the UI for any type of state that's source of truth in the database, DataScript doesn't add any value.
Nikolas Göbel: No. You get a web socket connection with tuples arriving and you do whatever with them. So for example, for visualization stuff, we usually feed directly into D3.
Dustin Getz: So for the peer functions you can't do ... what if you know the set of functions in advance?
Nikolas Göbel: Then it's no problem. Basically if you know what you want to do and you just implement them as built-in functions in Rust, it's quite flexible. You can even change the underlying data model, meaning how to arrange the tuples, or change the frontend language, if you want to use something other than Datalog. But what you can definitely do, is add more specialized operators, add specialized built-in functions and implement them all in Rust. That would be the easiest.
If you need something more complex where you want to integrate with completely different runtimes, what you can do is run a computation in whatever language you need, Python for example, and use the same kind of API. I declare my interests in Datalog, I receive data, I run some computation, and then I publish whatever I'm computing. I publish it as a new attribute to this differential cluster, and then everyone else can just continue using this attribute as if it would have been stored in the database, but it never actually touches Datomic.
Dustin Getz: But you have to model it in schema.
Nikolas Göbel: You need some way of avoiding naming collisions and giving good names to these things, but you don't actually have to model it in your scheme.
Dustin Getz: So it's not in Datomic schema. But there would be a separate layer of schema that has these draft attributes.
Nikolas Göbel: We're not completely clear on the best way to do this right now. We are imagining something like ... it would be part of the Datomic schema in such a way that there is just this attribute registered, in Datomic, with some metadata telling you "Hey, this is not an actual attribute, but this is a derived thing". Then just use this to keep the names synchronized and avoid name clashes but everyone else has to talk to differential to use those derived names. But, this is intended for things which are transient, which never need to be persisted or anything. For those you can also feed data back into Datomic or into Kafka or whatever.
Dustin Getz: So this is funded research through the university system?
Nikolas Göbel: No, this is not really funded. The underlying technology, Differential Dataflow, goes back to Microsoft research and the person who did that is doing it now (Frank McSherry). He's my advisor at ETH. It's his system now, it's a complete rewrite of the original system, MIT licensed. He has some PhDs here working on parts of it. For me it's my master's thesis and the rest is outside of the university context. So we're working to commercialize it, to bring it to companies for replacing Spark and stuff like that.
Dustin Getz: Okay. So tell me more about your commercialization plans. This is speculation at this point, right?
Nikolas Göbel: What I do on the side is plain old software consulting. This is intended to be a business, but we're not quite sure yet in what capacity. The core of the system, what is on GitHub and now my Datalog stuff, that is all going to be licensed the way it is right now and stay like that. We're mostly looking at doing consulting with this technology. We're talking to companies about replacing data pipelines that they have, and getting them customized query engines, domain specific stuff, things like that. We would be interested in looking at collaboration with the Datomic people, but there's no plans there yet.
Nikolas Göbel: What is not open source yet because it doesn't exist yet and we were thinking about keeping it proprietary, is related to the optimizations required to use this with many, many thousand concurrent users, which would then share parts of the same data flows and rearrange things a bit smarter. Basically like a query planner in a relational database. This is why I'm talking to people now, just figuring out kind of what people are interested in. I get the RethinkDB thing quite a lot. But we're not sure yet what this will be comercially, but whatever is available now, will just be available.
Dustin Getz: So this would be, Hyperfiddle needs something like this in the future, but it also needs the Datomic Ions product. Because when you run Hyperfiddle in a serious configuration, that's the secret sauce to making it work at scale. So, how ... it strikes me that it still could work with what you've built in Rust. it wouldn't run in Ions, but on AWS in an elastic configuration and, it can drop-in replace any Datomic query that does not use peer functions.
Nikolas Göbel: Yes. Ions is actually interesting, I think, because Ions is somewhat of a ... it shows that their philosophy to keep the thing working at scale is, have everything move the code into the same processes where you have the peer caches. Which is a super cool approach, but somehow opposed to what we're doing. We're saying don't mind the runtimes, don't mind the caches, as long as you can speak Datalog, we can efficiently get you whatever you need. Imagine the Datomic peers coordinating with each other. Basically, they have the same problem, right? They need to know what to keep locally, based on the queries they're seeing, and so this is a way of abstracting this without having to be part of the caching network. In a way, we don't have to run within Ions, but we can definitely, as you said, run in AWS and interact with code that is running in Ions.
Dustin Getz: You just said something interesting. Is it true that in a future world where you worked closely with Rich Hickey, you could provide an implementation of the Ions caching layer, which right now is probably a very simple cache. Maybe it's not simple anymore. I don't know. A LRU cache or something, as you might expect. It pulls the data it needs and whatever. You can reimplement that better.
Nikolas Göbel: I don't want to say better when Rich Hickey is involved, but in principle it could be, yes. The peers would keep track of all the queries and either do something clever like trying to find a superset of them or just register them all with each other as differential dataflows.
More on Datomic parity: peer functions, subqueries
Dustin Getz: So, you can't do peer functions, but you can do them if they're implemented in Rust. Now, can you do arbitrary, peer function if it was implemented in Rust?
Nikolas Göbel: Yes. The underlying, the differential layer is intended to ... everything you can do with your computer, you should be able to implement as this kind of dataflow. The guarantees that we give about incremental updating – that all the differential operators will only ever do work on the order of whatever you're changing – this will be broken, if you have operators that use arbitrary code, which itself might not be incremental. For example, if you do an aggregate and your aggregate function is a Rust function with lots of loops over all of the data or even doing requests to some other system, then ofcourse the guarantee will be broken. What you could do then, is use other techniques, simpler techniques for single-threaded incremental execution, like dynamic programming basically, and implement your operators like that. You have to be clever and you have to know the system to maintain the guarantees, but for simple things like custom aggregation functions, predicates, whatever, that shouldn't be a problem.
Dustin Getz: How about sub-queries? Calling Datomic API as a peer function inside of Datalog.
Nikolas Göbel: What do you want to call? Like a specific thing?
Dustin Getz: Let me ... I'm going to make this a concrete example.
Dustin Getz: Okay. So, this is a little Hyperfiddle prototype and see this query here, this is extremely hacked and not idiomatic, but the point is it's possible. It's something that we do sometimes.
Nikolas Göbel: Oh, so you call out to, you just executed another query within a query.
Dustin Getz: Yeah, and this works.
Nikolas Göbel: That's cool that it works. Interesting. Cool. For subqueries, the way we intend them to be used, this would definitely not be possible out of the box because, as far as I know, there isn't a Rust client for Datomic. This looks like ... I'm not sure if it's doing anything special, but what this looks like, is if you register the query within it as a separate rule, why would that not work?
Dustin Getz: I think that would work, yes. I don't really understand fully rules.
Nikolas Göbel: Okay. So to my mind, everything, like the query on top is just a weird way of writing a rule and everything else is just rules. So the unit of composition for queries would be rules. In this case, if you register a query on the server, you can give it multiple queries and tell the server not to publish all of them globally, and just reuse them locally. This is also how you get mutually recursive definitions and this looks like it would be just a nested rule there, where the :bitb.bankrollupdate stuff would be a new rule and then you use that from within the top level query.
Dustin Getz: Yeah. That's interesting.
Nikolas Göbel: I was going to ask because I looked at the live demo on the website, and I think it didn't have support for rules, is that right?
Dustin Getz: We haven't plumbed in through yet. We'll do it eventually.
Nikolas Göbel: That's probably quite risky on your end, if people shoot themselves in the face with the recursive rules. Same for us, but that's quite a bit of power there.
Dustin Getz: Yeah. We basically would need to do something like Google App Engine or AWS does or Ion and run things in lambas, such that it's bounded and metered and they pay for their own capacity. It can be terminated if it doesn't respond in the time frame. We're on the path to having that.
Nikolas Göbel: Cool. I think this would work. Calling out to Datomic in general from Rust would probably be hard.
Dustin Getz: Right. You would need to mirror the API exactly, essentially in Rust.
Nikolas Göbel: Yeah.
Feasibility of a Clojure implementation
Dustin Getz: What if we implement your abstractions in Clojure?
Nikolas Göbel: I think implementing the underlying system, all of Differential in Clojure, in principle of course then you'd have the same thing, but it's going to be hard. I thought about that myself. At the start I wanted to just reimplement the whole thing in Clojure, but it's quite an involved system. Especially the details of resolving things like recursion incrementally without violating anything. There's quite a bit of thought that went into the Rust implementation, so it would be non-trivial.
Dustin Getz: How many man-years of thought do you think is in that Rust implementation?
Nikolas Göbel: It builds on a long history of ... it's basically one of the only systems doing this kind of differential computation model. There's one proprietary database, it's called LogicBlox, which apparently is doing some of the same things. It's a pretty unique thing. A lot of man-years. I can't really do the comparison because it's, man-years by people (Frank and the rest of the Naiad team) who have thought about this for a long time before. I'm just beginning to grasp the things going on in this system.
Dustin Getz: Okay. So it's not something that someone can learn. It's something that you have to find the team who did it?
Nikolas Göbel: No, you can definitely learn. There's documentation on how it works. There's papers on this. You can definitely learn. I've been doing that for the last year, but it's, I think, a nontrivial distraction to try and build this by yourself. I would say the easier thing would be looking into getting WebAssembly to be a unified runtime. That would also be pretty experimental right now. There are experiments in running WebAssembly binaries directly on Linux and there is definitely proof-of-concept that this system runs on WebAssembly and that ClojureScript runs on WebAssembly. That I think, would be the much easier and nicer way, and would open it up to everything that can compile to WebAssembly. Will also take a while but seems seems more realistic at this point.
Niko's plans for the future
Dustin Getz: Tell me more about your role in this. You're a grad student and you also are a consultant in your spare time.
Nikolas Göbel: Yes.
Dustin Getz: What are you doing next?
Nikolas Göbel: I'm a grad student now. This has been a research project so far, and this will be my thesis project for the next six months. Apart from that, and during the thesis, we'll be mostly looking at the concurrent users use case, so optimizing this thing automatically instead of hand-building data flows and then we'll see. We're talking to quite a few people now, and I'd love to have this be a thing, be a business and focus on this space, data infrastructure, query languages, stream processing. We'll give this a couple of months to get feedback and see where it goes, and then decide next year whether we're actually going to do something with this full time.
Dustin Getz: What's the biggest risk that you see?
Nikolas Göbel: For you as a user?
Dustin Getz: For the technology as a whole.
Nikolas Göbel: The usual things, like ecosystem and documentation and mind share, because if you compare this to ... It completely, as a stream processing system without all of the Datalog fanciness on top, it completely blows out Beam and Spark and all of these things. Still you would not be able to comfortably give this to someone and say "Go and build your system with this". This is probably by far the biggest risk, that right now you need someone to build it for you or maintain it for you.
Dustin Getz: You're quite bullish that the technology does what you think it does in that it has great value in the applications that you think you think it does.
Nikolas Göbel: Yeah, definitely. We're using it internally now for ... We're not at a scale where it's totally interesting, but we're building stuff with it, so we're hoping to discover issues through that. Differential has definitely been benchmarked a lot of times in a research environment, comparisons against other systems, where it definitely does what it's supposed to do. I'm quite optimistic that it at least can provide a valuable new angle to this whole incremental query maintenance thing.
Nikolas Göbel: Risk wise, I can't really say anything more specific than that it should be put into a real-world scenario, before we can say more. We're hoping to do a proof-of-concept end of the month, and see whether it breaks down horribly. It's a rather simple use case, there's no datalog involved. It's just running this thing in production as a stream processing system. So if that goes fine and goes the way we're planning, then we'll definitely be more confident.
Stream processing implementation
Dustin Getz: So when you mean run it as a stream processing system, I've been picturing the stream basically being the Datomic transaction log. It can be anything else that's shaped like that.
Nikolas Göbel: So for the underlying thing, it can be anything else and you just feed it into this fairly generic data model. It's based around immutable collections, just like Clojure, but instead of being oriented around snapshots and trees and stuff like that, they're just accumulations of differences at various points in time. This is a rather flexible model, because the data part of the collection can be almost anything. We use it a lot for graphs and stuff like that, which can be done much more efficiently than Datomic, which is much more general of course. You can feed it almost anything and we're definitely not only looking at connecting to Datomic, so we're connecting to Kafka and reading from files and connecting to Postgres and stuff like that.
Dustin Getz: Okay. So it doesn't have the limitation of limited write throughput that Datomic theoretical has.
Nikolas Göbel: No, it hasn't. Having the limited write throughput is, I think, nice as a source-of-truth because you will get perfectly ordered consistent things, but you can definitely use Differential with something like Kafka and then use maybe some approximate timestamp to stay as consistent as possible.
Dustin Getz: You could use it as an input to Datomic too. You could have some real-time stream that filters down to a database value and then inputs ... I don't know why you'd want to do that though actually, because it already has the ability to have queries across it.
Nikolas Göbel: Yeah, I think there's still a lot of value for having either Datomic or DataScript as databases to talk to if you can afford it, from a scalability point of view. Testing and just dealing with database values locally in general. I don't think that would be as nice within the stream paradigm because ... Yeah, I think that would be weird. So ideally you could use Datomic and just tack this on if you run it in production and need the real-time sync and the reactiveness.
Whatever the underlying system offers in terms of write throughput .. unless it's much higher than what we can do, but databases usually have to do much more stuff and so probably will not be .... So whatever kind of throughput and consistency guarantees the underlying system offers, if there's some kind of consistent timestamp, then we can use it and all of the results will maintain these same guarantees.
If you do not require these kind of consistency guarantees, there is ways of making the system much simpler and much more scalable, if it's just best effort. One of the bigger selling points of this is that it actually does things correctly. So the same way you would, if you run the query directly in Datomic, you expect certain kinds of transactional guarantees, and we would maintain those. If you don't need those maintained, then there's probably ways to tune the system, or to use an entirely different approach to get an 80% solution with no guarantees.
Dustin Getz: Well this is amazing. Is there anything else that I haven't asked that we should talk about?
Nikolas Göbel: From my side? For Hyperfiddle, just from the demo that you see on the screen, it's pretty clear or it seems pretty clear what the use case could be. From my point, it's basically I want to know whether the real-timeness of it, the RethinkDB, the Meteor-ish aspects are actually value adds or whether that is mostly for fancy websites. But no, this was the gist of it. I will be talking about this at the Conj, and I have a blog post about this, a very high-level thing. We will be publishing more in-depth blog posts, demo use cases and stuff like that to play around with. And right now, the frontend and the backend are completely unstable in that we work on one and break the other. It's not something that's super nice and usable right now. We'll try and get there over the rest of the year.
Dustin Getz: Okay, well I'll digest this. I'm looking forward to your talk in Raleigh.
Nikolas Göbel: You will be there as well, right?
Dustin Getz: Yeah, of course. Yeah.
Nikolas Göbel: Cool. I think this would be quite a cool combination, because I think you also talked to the Datomic people, right?
Dustin Getz: Yeah.
Nikolas Göbel: I think talking with you and talking with the Datomic team would be pretty interesting in seeing how we can get this story for most of the Datomic community. I'm not sure what the plan is with Hyperfiddle, but I can imagine things like visualization and these kinds of business app use cases. If you work towards them, I can imagine this being quite a natural fit to create dashboards, create all these kinds of systems.
Dustin Getz: The interesting thing about Hyperfiddle, is it tries to be like a general purpose, one size fits all back end for ... the 90% of software, which is kind what Meteor tried to do, except it turns out that the corners that Meteor cuts matter a lot with data corruption. So Hyperfiddle goes really far, but the real-time aspect matters to a lot of people making applications. You look at what people spend money on in the future as they develop their apps. From the beginning, they were all spending a lot of engineering resources and trying to do real-time updates [crosstalk 00:36:33], especially simple things like, when I submit a transaction and I need to make sure that I never get stale reads or that type of thing. People spend a lot of energy in fixing little things like this. A system like this, if it was general purpose and no one noticed, that I think is really important.
Nikolas Göbel: Yeah, I think so too. Also, which is why I focused on Datomic so much with the work is, I think it's uniquely suited to be that kind of thing where you have an insanely good database for development purposes and for actually making sense of information. And then something that closes the gap for regular people. There was something just a while ago on Hacker News, basically drag and drop these kinds of enterprise applications together. (Retool.) Yeah. Right. I think that was what it was called, but I think Datomic has a lot more to offer for these kinds of things, if you look at auditing and historization. I think people are always blown away if you build software with Datomic and you show them something like "here you can flip through your changes". "You can flip through the history of this graph". There should be something of value there.
Nikolas Göbel: So yeah. That's super cool. I'm looking forward to hearing more and seeing more there and see you in a month.