Citus Data (YC S11) Wants To Make Scalable Data Analytics Accessible To Anyone

smilliken · on June 27, 2012

We've been using Citus at MixRank for storing our timeseries data, and it's worked out magnificently well for our use-case.

A few points:

(i) We can do ad-hoc realtime analytics on hundreds of millions of data points.

(ii) We can also do realtime analytics on billions of datapoints as long as we pre-compute along one dimension.

(iii) We could do a lot better at (i) and (ii) if we invested more heavily in hardware (and Citus would make this pretty painless, actually).

(iv) I'd normally not consider a closed-source solution personally, but since Citus is based so heavily on PostgreSQL (protocol-level compatibility, configuration, codebase), this has been a non-issue for us. We can still lean on the amazing PostgreSQL community, documentation, and for the parts we don't have the source code to, the Citus team has been very helpful in explaining how things work.

(v) Fault tolerance is immaculate. At the node level, PostgreSQL is notoriously one of the most reliable and robust databases available. At the cluster level, Citus will magically fall back to a replica mid-query when a server dies.

(vi) Although realtime inserts are not supported out of the box, the system is flexible enough that we were able to get this working on our own without help from Citus.

(vii) Schema migrations are also not supported out of the box, but we built a schema migration framework that takes care of this for us.

(viii) We're not worried about vendor lock-in, since the data is just stored on our servers, in the PostgreSQL serialization format. If we wanted to, we could just give up the features that Citus gives us and build our own data-access layer on top of our cluster.

Anyway, it won't be everything to everyone, but it works very well for our OLAP use-case of timeseries ad impression data. I'd definitely recommend looking into it if you're otherwise considering Hadoop, Vertica, Aster, Greenplum, or a sharded MySQL/PostgreSQL setup.

Full disclosure: I am extremely biased since I've gotten to know the team very well after using Citus. I'm definitely one of their biggest fans, if for no other reason the amount of time they've saved us at MixRank.

nwyc2012 · on June 28, 2012

Hi Smilliken Would it be possible to elaborate a bit on how you do real time inserts/updates? I'm interested in trying Citus but would most probably need the realtime feature for production use

smilliken · on June 28, 2012

I'd be happy to help, but this is probably out of scope of an HN comment. Feel free to reach out to the email in my profile.

johnpmayer · on June 27, 2012

This sounds a lot like AsterData Database, which I know has been around for at least a few years. I'm interested to know if you are able to write queries that define explicit parallelism like in the SQL-MapReduce language extension, and also the ability of the query preprocessor for the distributed workload.

ozgune · on June 27, 2012

Hey, we currently don't have an SQL/MapReduce language extension, but we do have the Map & Reduce execution primitives implemented under the covers (for parallel query processing).

For the distributed query processor, we can efficiently parallelize SQL queries that involve look-ups, complex selections, groupings and orderings, analytic functions, and joins between one large and multiple small tables. We also have a lot more coming; are there any queries that you are particularly interested in?

pella · on June 27, 2012

any information about PostGis compatibility?

ozgune · on June 27, 2012

In all honesty, I'd have to check. The worker nodes in our architecture will be able to use PostGIS indexes just like regular PostgreSQL instances, and the master node should correctly handle (partition) most PostGIS functions. Still, we'd probably need to implement parallelization for the && operator, which shouldn't be that hard.

That said, I don't want to misguide you here before going over PostGIS' documentation more thoroughly. If you could ping us at engage@citusdata.com, we'll send you a reply once we know for sure.

benbjohnson · on June 27, 2012

Distributed SQL queries are cool and accessible to people but I feel like projects that apply relational languages to event data don't make much sense. If I have click stream data then I'm more interested in knowing what users are doing after they performed action "A", "B" & "C" than rolling up how many users performed a single action "A". SQL falls flat on its face for this type of analysis.

Also, the name also threw me off. I thought it was "Citrus" and not "Citus".

[Full disclosure: I am writing an open source, distributed, behavioral database - https://github.com/skylandlabs/sky]

ozgune · on June 27, 2012

There's value in both types of analyses. For knowing what users do after they perform action "A", "B" & "C", many people currently rely on implementing Map/Reduce programs.

That can be a bit heavyweight if you want to simply compare people who did action "A" or "B", filter based on complex criteria, or apply simple analytic functions. Also, apart from standard relational algebra operators, SQL provides a lot of convenience functions for math operations, string manipulations, date and time formatting, pattern matching, and so forth. These may come in handy to users who want to quickly gather insights out of their data.

benbjohnson · on June 27, 2012

You're right, there is value from SQL over event data, however, I feel like it's a missed opportunity to simply apply the same paradigms to a different type of data. I'm not suggesting that SQL be thrown out but a new language needs to be available specifically for event data.

bgilroy26 · on June 27, 2012

I'm working outside the bounds of my understanding here, but why can't that result be formed from a query that pulls from the user_history table and a subquery that pulls from the user_history table with different conditions on each one?

benbjohnson · on June 27, 2012

Self joining is a good thought. It is possible to write a query like this but handling n+ steps (A..B, A..B..C, A..B..C..n) is eventually going to make your query optimizer sh*t all over itself in all likeliness. You're not simply joining two tables but you also have a temporal relationship between each event. For example, if you're looking for users who performed events A then B then C, then you need to self join B to A making sure that all events in B are after A and then all events in C are after B.

Beyond that you have to worry about whether A, B and C are all within the same session. Trying to define a session such as "all events that occurred until there is 30 minutes of idle time" is going to be damn near impossible in the SQL query.

scottedwards200 · on June 27, 2012

Couldn't a recursive query help here? Oracle has had CONNECT BY for years, and SQL Server has had recursive queries since 2005 (not sure about open source).

And, if SQL isn't the best way, what is? Map-reduce seems hellish to write to me, requiring very advanced developer skills - maybe some Python code?

benbjohnson · on June 27, 2012

Oracle's CONNECT BY is a good idea. I forgot about that. CONNECT BY is technically for hierarchical queries but you could probably hack something together between connect by and windowing in Oracle. I'd be curious to see how the performance is.

Part of my behavioral database is a query language for event data. It's called EQL (Event Query Language). I don't mean to slam SQL to try to make my solution sound better. I wrangled event data with SQL at a company years ago and it was awful. Data was denormalized and stored in rows but we ran into issues of max column lengths and row chaining not to mention a ton of custom processing.

I've also tried using Redis as an event store but you really need your data processing on the same box as the data to get good performance. For example, Redis supports ~100K calls per second (depending on the command used) on a single box. You can retrieve multiple events at one time but you're still going to hit a network performance and CPU bottleneck serializing all that data, not to mention that you still need to process it. I built my behavioral database to compile EQL to optimized machine code using LLVM and then iterating over events in memory-mapped data files. As a comparison, I'm traversing about 50 million events per core per second.

I'm not trying to knock RDBMS systems or Redis. I used to be an Oracle DBA and I've used Redis on plenty of projects. Behavioral data is just a different beast though. It needs fundamentally different tools.

ozgune · on June 27, 2012

It's doable, but the paradigm doesn't naturally fit into SQL. Users typically need to use sub-selects and self-joins for this kind of "A", "B" & "C" analysis; and that introduces inefficiencies.

bkj123 · on June 27, 2012

This is where a good analytical (as opposed to operational) data model comes into play. Something less normalized that is intuitive and efficient to use.

benbjohnson · on June 27, 2012

Event data is difficult to denormalize within a relational database. You can group events by actor (e.g. the user performing the events) and stuff it all into a single row. You'll get great performance but you'll have to custom process your data which is in a custom format. SQL functions will become useless.

You can also store every event as a row and pull it out through Hadoop to process (as another commenter mentioned) but you're going to get huge performance bottlenecks just in extracting the data from the database, serializing it, transferring to another server for processing, and then deserializing it. Not to mention that MapReduce is batch-oriented so it's not going to be real-time.

bgilroy26 · on June 27, 2012

In a typical data environment for an event table, you can have a user_history table with the unique_primary_key, timestamp, user_id, attribute_column_name, Previous_value, and Subsequent_value.

If you want to know what users who did X and later did Y, couldn't you select all of the users who did X in a subquery and then find out how many of those user_id's match user_id's of people who did Y, where the timestamp on Y is between the X's timestamp and X's timestamp + some_predetermined_amount_of_time?

I am out of my depth, but isn't there some database best practice for tracking user session start and stop times?

Edit: I wrote this comment before I saw a different comment of yours above, which I think answers my question.

pella · on June 27, 2012

"Features Not in v1.0"

http://www.citusdata.com/documentation#missing-features

kermatt · on June 28, 2012

"Real-time insert, update, or deletes issued against the master node."

Is this a bulk/batch load only system then?

spathak · on June 29, 2012

That is correct. This is primarily a bulk-load system. There are setups (as mentioned in the smilliken's comment above), where it can be used for real-time inserts, but requires more hands-on setup and configuration.

edouard1234567 · on June 27, 2012

Congrats Umur and team. I look forward to trying it for ZeTrip analytics.

arrsingh · on June 28, 2012

Congratulations on the launch! I've been following this company for a while and its great to see the public launch!

emre · on June 27, 2012

Congrats to the Citus team! It looks like a great product, will definitely use it!

kolistivra · on June 27, 2012

Looks very promising, I will definitely give it a shot! Good job!

seboavalin · on June 27, 2012

Scalable data analytics accessible to anyone? Great!

pinarsezer · on June 27, 2012

Love the video btw!