Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Pulsar vs. Kafka (streamnative.io)
112 points by ceohockey60 on July 10, 2020 | hide | past | favorite | 99 comments



> Lower end-to-end latency helps enterprises gain business insights faster.

They lost me here. I can think of plenty of situations where reduced latency is beneficial, but not many situations where shaving a few milliseconds would make a difference to “business insight”!

Although I suppose it is strictly correct, in the tautological sense...


So I've developed an in-house fancy thoroughbred 'real time' data warehouse, a very rare beast indeed, and its awesome.

Of course, our business is still running on nightly reports. But the tech is cool!

So I wanna say you're wrong, but I've got man-years invested in a system that hasn't been utilized to its full potential yet :(

Time will tell. Somewhere, some competitor will be using real-time insights to out-compete us.


Nightly to seconds is definitely great. Seconds to milliseconds is what is questionable I think.


> Nightly to seconds is definitely great

Why? Can a business mobilise in anything less than days? If a report is minutes out of data, is that any loss? Given that some largish proportion of reports are never used, perhaps better management is key.

not disagreeing but efficiency is not just a matter of quickness.


Nightly to seconds = developers can see the data generated within seconds.

This generally has implications on data quality, because you aren't fixing data quality issues once they occur in prod.

This also makes it far easier to develop ETL pipelines. You don't need complex tooling to see whether your ETL pipeline works.

You can technically fix data quality and dev velocity issues without low data freshness, but a quick glance at the data engineering landscape tells you they aren't being solved enough.


I work in oil and gas with process optimization, and we are working on autonomous tuning of the production process in a way that is not possible with daily data updates and manual human intervention.

Quite simply, the end goal is increased oil production and reduced power consumption. Realtime data flows are a big deal when reading from thousands of sensors for just one process system and there are hundreds of tunable parameters. All this has to be safe also.


I agree shaving dozens or hundreds milliseconds of latency is hardly noticeable to end users. But latency is an indicator of how well the system can perform and scale up. Signs of high latency under normal load can reveal design or implementation flaws in the software (supposedly running on any modern hardware) Ultimately you want a system can scale up and delivers consistent latency. Therefore, a low and consistent latency is a health-meter to assure that. Within the same cluster (no network I/O), Pulsar has pretty impressive 5ms pub/sub latency for persistent topics (including written to and acknowledged by the non-volatile disk)


I agree with everything you're saying about the value of reduced/smoothed out latencies and I suspect the parent agrees with that as well, but the parent was mostly teasing about the choice of "business insights" in lieu of all you've mentioned.


All those articles of kafka vs pulsar are always biased (this one is from a company selling pulsar). There are so many of them that I can't get an opinion on which one is good for what.


Pulsar is more flexible and fault-tolerant. For me the most important thing is client can request log queue starting from specific log by id, it has better retrying mechanisms for logs that failed to be processed. But it has absurdly bad documentation. I had to learn many things about Pulsar by downloading src of their java library and just reading the code. Documentation on starting bookie, zookie, pulsar, pulsar-proxy cluster was non-existent, and making it work like in their architectural diagram was a week of work and experiments, compared to literally 2 hours spent in Kafka.

Bookkeeper has/had even worse documentation and setup. When I used it bookie couldn't have mounted directory for log storage, so it's not really that persistent as they say. Before restarting bookie node it has/had to be reformatted to allow bookie instance to re-use logs saved on disk. The whole "log are persisted" is true, but they don't say that you can't simple mount them in docker and restart your PC to it.

Pulsar is good when you get it working. Documentation is really bad and it's hard to make it work. All the super-positive articles about pulsar are sponsored (streamnative and yahoo) and biased to make pulsar look much, much, MUCH SIMPLER than it actually is.


> For me the most important thing is client can request log queue starting from specific log by id

Kafka clients can start at a particular offset ID within a partition.


This is definitely an area where Pulasr is trying to improve, getting started is not easy. That said, the progress I have seen in the last year since I have been involved with it is really promising.


AWS should fork Pulsar and put out a v2 streaming product. Kinesis is kind of crappy (IMO) and doesn't seem to be improving much. If you look at the Pulsar architecture and feature set you can tell that it was designed very much with this in mind (something that large scale cloud providers can integrate with their infinitely scalable storage and compute systems).

It's not all hype either, according to this post https://jack-vanlightly.com/blog/2018/10/21/how-to-not-lose-... it seems like a solid piece of tech.


Instead of Kinesis we you AWS MSK (Managed Kafka), which is expensive but works quite well.


Calling MSK expensive is maybe not being clear enough. It's more than 2x the cost of the raw EC2 instances. Then on top of that, to get metrics at the per broker level there's an upcharge, and a further one for per topic metrics. These are things that are actually extremely important at scale, and it's absurd how expensive it gets.

Then you combine that with how immature the UX is, and it really just doesn't feel good to deal with. It's not that much work to run a Kafka cluster with the sort of design that MSK provides, and it can be done better than that without much cost.


How is Kinesis crappy?


I don't think the platform or the pricing model were well architected for a 1 producer, many (and growing) consumer use case which is IMO the most compelling use case for a streaming system.

All the "success cases", sample architectures and real in the wild systems I've seen built on top of Kinesis have 1-3 consumers max.

They added the enhanced fan-out to try to get around this but it seems like you have to (over)pay for a ton of provisioned capacity to ensure you get decent latency on a > 5 consumer use case. So much like Lambda, it's only "serverless" or non-provisioned for people who have super low expectations of what system software ought to be capable of. For everyone else, it's just a mediocre expensive provisioned solution.


With Pulsar vs Kafka, I don't see a huge argument between either one functionality wise as they have so much in common (distributed log, Java based, avoid copying memory, use Zookeeper). Because Kafka is more supported and well-known it seems Pulsar needs to be an order of magnitude more performant to capture developer mindshare.

I see the same with Spark vs Flink in that similarities outweigh differences. I wonder if this is some sort of emergent pattern in open source software.


There are real differences among them. Here is some painful aspects of Kafka:

1. A single partition is stored in one node (replicas on another nodes). With this, introducing new nodes takes very long time to replicate large partitions, because it can replicate one partition from only one node (leader of the partition). On Pulsar each segment of partition is stored in a different bookkeeper node.

2. Because of 1, if two consumers read different parts of a partition that are far from each other, they will compete over disk bandwidth. In Kafka consumer can not read from replica node. If a topic is really popular and many consumers try to read from it (from different parts of the file which makes OS page cache useless), total consumption rate is limited to disk bandwidth of a single node. But in Pulsar each consumer can read from different brokers. Catch up consumers won't trash streaming consumers in Pulsar.

These are not problems that can be fixed easily. Additionally, in the realm of streaming the difference between Flink and Spark is day and night. The low watermark feature that Flink offers makes them behave fundamentally different.


1. is true, but if you want that data to move to a new node, it still needs to be replicated. Kafka's approach is to use tiered storage (which I believe is close to completion).

2. Kafka can read from a replica node. It's relatively new but it's there.


That's true but still limitation is not fully resolved. In order to increase consumption rate, we need to add replicas. In pulsar Brokers are merely cache nodes over Bookkeeper. Adding more Brokers is trivial in Pulsar.


How in pulsar do they get around the fact that adding a new broker, data needs to be moved over before that broker can start serving data? This seems like a basic law of physics type limitation to me.


Hey, I work on Pulsar, will try and answer this :)

Topics (actually bundles of topics, called bundles) are what is assigned to Brokers. Topic assignment is dynamic, so when a new broker is added, the system will try and shed load from the busiest brokers to even it out on the system.

But unlike Kafka, when a topic is assigned to a broker, it doesn't have much state to move, mostly it just gets metadata added to it and opens a new "ledger" (which is just a chunk of the topics data over a time window, only one ledger is ever open at once). When it needs to serve data, it pulls that from bookkeeper nodes from previous ledgers, so the process of re-distributing load is pretty quick, it also doesn't eagerly pull in a cache.

Now, as far as the cache, that is primarily for "tailing reads", meaning, as writes occurs, and clients who are close to the tip of the recent data will just get it from the broker, without a need to pull it from bookkeeper. This is is one of the key parts about how Pulsar has multiple tiers of storage that help it have such good consistent latency.

Beyond processing writes, the biggest thing brokers do is handling "tailing reads" i.e., clients are consuming right near the tip of the topic. , this is the cache referred to. That means that when a new pbroker is three purposes:

1. Handling writes


(copying this text from another comment of mine elsewhere)

Well, the Pulsar broker is (kinda) stateless, because they are essentially a caching layer in front of BookKeeper. But where's your data actually stored then? In BookKeeper bookies, which are stateful. Killing and replacing/restarting a Bookkeeper node requires the same redistribution of data as required in Kafka’s case. (Additionally, BookKeeper needs a separate data recovery daemon to be run and operated, https://bookkeeper.apache.org/archives/docs/r4.4.0/bookieRec...)

So the comparison of 'Pulsar broker' vs. 'Kafka broker' is very misleading because, despite identical names, the respective brokers provide very different functionality. It's an apples-to-oranges comparison, like if you'd compare memcached (Pulsar broker) vs. Postgres (Kafka broker).


Network is faster than disk. Once cached, then you are only bound by network IO for subsequent uses.


Sure- but how is this different than kafka's caching?


Pulsar is better for very large scale deployments provided you have people to manage it


Kafka is handling very large scale deployments just fine atm in all the big tech co's.

The only thing I can see that can make this true is Pulsar seems to have better elastic scalability. But it seems to score less on everything else. It has a much more complex storage system that ends up not matching Kafka's high-end throughput at large scale.

From what I recall, Twitter ended up abandoning BookKeeper due to storage scale concerns. Related: https://blog.twitter.com/engineering/en_us/topics/insights/2...


This is mostly due to the difficulties scaling DistributedLog more so than BookKeeper. DistributedLog basically had no contributors other than Twitter and was just too big of a mountain to climb alone. The blog post you linked goes somewhat into this but that is ultimately why the choice to transition away was made.

Pulsar likely would have been considered if it was more mature at the time and sported a community of comparable size to Kafka (it's still a long way from this).


Show me one


>it seems Pulsar needs to be an order of magnitude more performant to capture developer mindshare.

Just to add to this, ease of use/setup is also a huge factor. There are technologies I can just spin up with zero knowledge and learn as I go. These are huge factors in adoption especially with Golang and nodejs.


This smacks of being heavily one-product-focussed to me. Being a Kafka user it's hard enough managing and understanding one system, nevermind three or four joined together.

Maybe it's a bit faster or a bit more elastic, or whatever, who knows. What I really care about is whether I get called at 3am and in that regard the argument seems pretty weak. Kafka for all its woes is a solid system you know you can count on.

I'd much rather see someone come up with a truly innovative alternative that actually pushes the boundaries, rather than just copying what's there already, and adding a few window dressings.


What kind of (lower level) surrogate metrics would you be interested in that could translate to '3am phone calls' when comparing messaging systems?


Being used by at least one company of significant size that (a) i've heard of and (b) isn't directly connected to the project would be a good start.


'Directly connected' to a project might mean a user of - but I assume you mean a major contributor to (as even small time users of free software often contribute something - bug reports, feature requests, code contributions, money, etc).

The page: https://pulsar.apache.org/powered-by/ suggests there's quite some number of corporate users who are happy to confirm they use this suite. I don't know how many of those you've heard of, though.

I suspect many private & government agencies around the world would decline to formally attach their name to any list like this, lest it be (mis)interpreted as an endorsement.


I've heard of Comcast, but that's Yahoo. Not heard of the others.


I’ve enjoyed using Pulsar but ZooKeeper... arghh. It’s an excellent component but a pain to manage.

Looking forward to trying Kafka again when they finally remove ZK


What difficulties have you had with Zookeeper and Kafka? Zookeeper can be difficult mostly because developers don't understand it very well. But in the case of Pulsar/BookKeeper/Kafka the usage of Zookeeper is very minimal so it's main problem constraint (performance) is mostly mitigated. Availability and management wise Zookeeper 3.5+ is actually pretty great. You do need to understand dynamic ensemble management but really it's a small price to pay for it's rock solid nature. Stuff like etcd is getting close these days but it took 3 protocol versions and tons of bugs, performance and scalability problems for it to get close to ZK.


I really like the architecture of Pulsar, it is very elegant with clear separation of duties between components.

That said, we have used Kafka for so long, and it works well enough at our scale that we have no reason to even test Pulsar. Kafka also has much more integrations with other tools due to its popularity.


Can anyone share their thoughts on whether, in case of a new project is it worth to start with Pulsar instead of Kafka as a distributed log/pub sub solution with guaranteed delivery? I heard a lot of stories about Kafka's operational complexity and TFA seems to be pointing out that Pulsar has a lower operational upkeep (ie. less manpower needed to keep it running).


Full disclosure - I am co-founder of https://kesque.com . We provide Pulsar as a managed service and different tiers of SaaS plan. My opinion can be biased.

For a new project, you should try to document different aspects of requirements. Is it data streaming, queuing, or both? What's the data retention policy? Message rate? How many consumers and producers? Any inbound or outbound integration with 3rd party destination (i.e. S3, Flink)? Both Kafka and Pulsar have so many features to offer. It is not a simple task to pick one vs another. If you ask for guaranteed delivery, both will satisfy that requirement. A level up question would be who can guarantee in-order delivery.

Managing Kafka and Pulsar require knowledge. I do not think any of these durable messaging software is maintenance free (or industry is not there yet). Any reliable distributed system is complex out of necessity. These system more or less require log consensus algorithm to achieve high availability. They all use either zookeeper or one of raft implementations requiring multiple nodes to perform leader election. This is common in all distributed architecture (kafka, Pulsar, Cockroach, etcd...). I would attest Pulsar can be administratively simpler than Kafka, because of separation of broker and bookkeeper (data persistent layer). But this does not mean any dev-op without knowledge can proficiently manage the cluster. We use Kubernets/Helm to manage all of our Pulsar clusters. I would not credit Pulsar alone with low operation upkeep. It is combinations of Kubernetes, Helm, in-house tools, and engineering knowledge to lower the operation cost.


If you are worried about operational complexity and don't know enough to get a clear set of requirement and tradeoffs beforehand, just start off with a single rabbitmq node. By the time you outgrow it, you'll have hired someone who'll be able to make the call for you.


This is likely the best advice in practice as the complexity kafka introduces really only becomes apparent at scale. We ran managed services for our primary production in the 50k+ RPS range that needed a constant stream of tweaking. On the other hand, the isolated EU cluster was self hosted and ran without incident or intervention for 18 months at 1-3k RPS.


The good. To have 100% reliable Kafka, your data (both transfer and storage) needs bot replicated by the factor of 3 (you will need to transfer and store 3 copies of all your data). Pulsar will happily do with just 2x. Next, thanks to the "failover" subscription type and producer real-time deduplication, high-availability consumers are easier/cleaner done on Pulsar. On Kafka, you have to fiddle with Consumer Groups and being very aware of partition numbers of your topic.

Now with that said, these are the current downsides of Pulsar as recently perceived by me: More complex architecture (in the Kubernetes context) (zookeeper + broker + bookie + proxy + autorecovery [+bastion/prometheus/grafana]) vs Kafka's (zookeeper + broker [they are working on removing zookeper now]). Pulsar has extremely active development, but not that many active developers or active community size, large number of active bugs, the project is far less mature than Kafka, the documentation body is much smaller and worse, and the respective Stack-overflow (etc.) knowledge-base is much smaller than Kafka's. My experience with Pulsar's deployment to Kubernetes was that is that it wasn't ready. E.g. a lot of startup synchronization is being done by k8s yaml-embedded startup scripts and is extremely brittle and in some cases broken (you have to manually restart the proxy pod on startup, etc.). Due to the metric exposure on Kafka I find some critical devops scenarios (e.g. handling a full node loss) to be more transparent and somewhat more possible to approach than with Pulsar which appears more complex and more of a black box to me in this respect at this time. One big advantage often brought up in Pulsar vs Kafka comparison is that Pulsar has active sharding rebalance, but to my knowledge Kafka now has something like that too, albeit maybe not as dynamic as Pulsar has, not sure.

So to summarize, Pulsar is great, but actively developed and quite complex given the size of the (serious) active user base and the active developer base. I think the active user base being the main developmental driving force, it's a chicken-egg issue which simply takes its own time to evolve in parallel. We have to realize that Kafka is a 10 year old open source Apache product, thus very mature, and that's why I'd recommend it for new projects (which need to reach production quickly) over Pulsar at this time.


> Pulsar will happily do with just 2x. This is just wrong. Pulsar provides weaker guarantees than Kafka. It's a quorum based system. If you run with two replicas Pulsar can't provide F-1 guarantees which Kafka can.


The claims in the article about lower operational upkeep are pretty disingenuous. They only talk about things that are easier in pulsar than kafka, but not about the complete picture at all. I'd largely ignore the article and go have a look at how to run both independently. As another commenter posted, they both require zookeeper to run (there are efforts to remove ZK from kafka, but until that's proven you'll still need to assume you need ZK), so the operational upkeep it going to be very similar.


Because Pulsar and Kafka both use Zookeeper, the operational complexity will largely be the same. The best thing you can do to lower op complexity is using managed Kafka by Confluent or AWS (something similar probably exists for pulsar).



Yeah was just gonna suggest either any one of AWS's queue services (SQS / Kinesis or MSK), or GCP (Cloud Pub/Sub / Managed Kafka). A quick search yields Azure Queue Storage(?) as a potential hosted option


On the issue of delivery ... can Pulsar users here talk about exactly once message delivery from the consumer side?

I use Kafka and needed a cache based on incoming events which were partitioned.

But if the consumer crashes it's not easy to pick up from exactly where it left off without manually committing offsets which hurts performance. There's also some hand waving Kafka gossip that it's hard to commit offsets right. Further suppose a task consumes events from topic A partition 3, produces events to topic B. Again, if the task crashes it's not clear what needs replaying in order to not to lose messages (on write) or missing messages (on read).

Any insight here is greatly appreciated.


Unlike Kafka, and despite some unfortunately misleading wordings in articles and documentation pages, Pulsar doesn't actually support exactly-once aka 'effectively-once' semantics because it lacks support for transactions. It only supports an idempotent producer combined with message deduplication. The current functionality only works when producing one message and to only one partition. For example, you cannot atomically produce multiple messages to one partition with Pulsar today, let alone multiple partitions.

See this Dec 2019 presentation by Pulsar committers, where they explain all this in more detail, i.e., the lack of transactions and the resulting limitations, and the motivation for adding such transactions to Pulsar. The approach looks very similar to Kafka's. https://www.slideshare.net/streamnative/transaction-preview-... The original ETA for transactions was Pulsar v2.6 (June 2020), but as of today there's still quite some work to be done (https://github.com/apache/pulsar/issues/2664). The latest ETA seems to be around the end of the year.

The key difference for an end user is that Kafka released all the functionality in one go back in 2017 (idempotent producer, transactions; which fwiw also explains why designing+building+testing took the Kafka community that long) so it has been much easier to understand what is actually supported vs. what is not.


Thanks. I'm gonna recheck Kafka then.


PS: Not fully sure what could have caused your Kafka woes. Certainly all what you described is supported, and it also 'should' normally be easy to use as a user/developer.

For example, with Kafka Streams, any app you build with it just needs to set “processing.guarantee” to “exactly_once” in its configuration, and regardless of what happens to the app or its environment it will not lose messages (on write) or miss messages (on read) from Kafka.

Consider asking your question with a few more details in the Kafka user mailing list [1], or in the Confluent Community Slack [2] if you prefer chatting.

[1] https://kafka.apache.org/contact [2] https://launchpass.com/confluentcommunity


One very specific feature where Pulsar is shining is that you don't need to explicitly create topics [1]. It doesn't seem like much but is very powerful. At least for a CQRS subscription architectural pattern I'm working on at the moment.

Say you have a front-end dealing with the clients in a streaming manner (be it websockets or SSE). All front-end instances send messages to a topic on a messaging system. Processing is done with Flink or Spark, but now you need to get some answer back (or publish regular updates) to the client; so you push it to another topic on the messaging system. Works fine if you have a fixed and low number of front-ends; they pull everything and select messages for their clients. If you have more front-ends you want to have them pull only the messages destined to their clients. You might want to use Kafka partitions to do this, but it is kinda clumsy.

Furthermore if you need to scale the front-end, you'll have to reassign a partition scheme to all the front-end instances while they continue to cater to their specific clients. On top of restarting the Flink/Spark processing to fit the new partition scheme. I don't know of a simple way to do that with Kafka.

In Pulsar, the problem becomes _much simpler_: have the front-end chosse a UUID that represents them, send it as part of the messages, and interpret it as a return adress. The processing then pushes out to topics like: persistent://domain-x/app-y/back-to-clients-<uuid>. Done. No need for repartioning or topic creation.

Other than that, the pros are: the messaging Key_Shared mode [2], worth looking at; and you also get some message acknowledgement features. Cons is deployment, which is quite involved.

[1] https://pulsar.apache.org/docs/en/concepts-messaging/#no-nee...

[2] https://pulsar.apache.org/docs/en/concepts-messaging/#key_sh...


I don’t understand this, you don’t have to explicitly create Kafka topics unless you configure this to be a requirement.

I’ve implemented exactly what you talk about using dynamic topics in Kafka and it was trivial.

Maybe I’m missing something?


I'm lost too. Kafka auto creates topics by default. Maybe you're referring to being able to create more topics? But that seems to be unproven. Kafka's limit is metadata and Pulsar is more metadata dependent than Kafka.


Is there any sort of 'single node' version of these frameworks? I'm very interested in building event-driven solutions, but I don't need the scale offered by kakfa/pulsar, and I really don't want all the complexity. Is there any reason nobody has made a smaller, less distributed event-centric DB?


Redis Streams is a pretty good lightweight replacement for Kafka. Of course, what Kafka gets you are durability, being able to pick among at-least-once, at-most-once and exactly-once delivery of messages, among other things


Thank you, I'll check it out


You can run kafka on a single node no problem.

Give this a try: https://www.digitalocean.com/community/tutorials/how-to-inst...


Single node Kafka still needs Zookeeper, though.


Oh cool, thank you!


Pulsar does offer a standalone mode. It has a vertical stack of Pulsar broker, bookkeeper, and zookeeper in one process. The standalone mode also comes in a single docker image.

However, the complexity you refer to is essential in a reliable messaging framework. Use of zookeeper or any log consensus algorithm requires multiple nodes ( 3 or more) to achieve durability and high availability goal. It is out of necessity. This is actually essential complexity.

There is another messaging framework called Nats.io. It is not persistent so architecturally relatively simpler. You might want to investigate.


Postgres can be used as an effective event centric database and pub/sub queue with NOTIFY and LISTEN


Pulsar offers a standalone mode, which runs everything inside a single JVM and you can just backup the data files. Pulsar also can be made to run with multiple components, but not distributed.


nats streaming worth checking out, or if you dont care about the streaming part nats/zeromq should be simple enough (simple pub/sub infra tools).


Redis?


Can pulsar users here talk about exactly once message delivery from the consumer side? I use Kafka and needed a cache based on incoming events. But if the consumer crashes it's not easy to pick up from exactly where it left off without manually committing offsets which hurts performance. There's also some hand waving Kafka gossip that it's hard to commit offsets right. Any insight here is greatly appreciated.


Another service that "needs" zookeeper. Has anyone figured out a simple way to manage zookeeper for a small team?


I keep hearing this, and I've had this impression myself. But I've ben running a small ZK ensemble for two years just to support a Kafka cluster. Between them, ZooKeeper has been much less trouble, almost zero problems.


What problems have you had with it?


Its just more complicated than it needs to be. Esp I have a hobby project with tiny usage.


Nope. It sucks. RIP.


Mentioning "Fortune 100 companies" in the first paragraph is a red flag of enterprise BS for me. Not sold.


> two of the most favored messaging systems on the market

Give me a break, I'd literally never even heard of Pulsar until this article popped up.

Of all messaging systems I would have thought Kafka vs SQS, or even RabbitMQ at the very least


For people currently interested in building an event based architecture, pulsar is definitely a very well known option.


I am squarely in the "people interested in building an event based architecture". Not a CS background, but know tech decently. I typically know the names of more of these apache projects than most people I've talked to IRL (though ostensibly I'm not part of the tech elite). Yet pulsar I only came across on HN a week back. And nothing changes due to the knowledge. It's yet another apache product which has great design but needs either a genius or an army of devops to deploy. Like even if I figure out what the hell this thing is, I'm then tasked with figuring out what the hell zookeeper is (like for real, I'll buy a beer for someone who can successfully ELI5 wth zookeeper is. And also pig. Or impala).


If you can't understand what Zookeeper is, I'd recommend reading Martin Kleppmann's book Designing Data-Intensive Applications (https://dataintensive.net/).

You don't need a CS degree to work in this field (I don't have one either!) but there are fundamental concepts you need to understand in order to make informed decisions when designing distributed systems.


You don't need a CS degree or Martin Kleppmann's book to work out it's a GPITA.


I am also not a fan of Zookeeper but I have come to respect it. In defense of Zookeeper, distributed systems are a GPITA. If you need to select some components why not go for ones that are solid [1].

[1] https://aphyr.com/posts/291-jepsen-zookeeper


I'm no expert, but my understanding is that pig is a combination of

- a language for specifying data transformations, and

- an engine to compile programs written in that language into mapreduce jobs to execute on a hadoop cluster

it was designed to easily map some common functional and SQL idioms (e.g. filter, group by w/ aggregation functions) to parallel execution for processing huge amounts of data.

Impala is another big data project that is an engine for planning and executing SQL on data stored in a hadoop cluster.

Zookeeper is... black magic??


> Yet pulsar I only came across on HN a week back.

Same - was it the comment saying nobody chooses Kafka anymore? I was surprised by it.


If you search you will find that there are always the same 4-5 HN users who comment on pulsar/kafka topics arguing that Kafka is not a viable solution anymore. Even though almost everyone is still using Kafka


Meanwhile the Confluent Kafka conference seems to double in size year after year. And AWS chooses to offer Kafka as a service, rather than Pulsar. These "<Solution> is dead / not viable anymore!" comments are pretty rampant in the Big Data space. Some are true though but it still makes me roll my eyes.

These people seem to be invested (quite literally) in Pulsar disrupting Kafka's position in the pub/sub & event streaming space. This isn't to say Pulsar can't someday be superior, though. Time will tell. But Kafka has such a massive lead and it fits the needs for most people who need to do data integration that I don't see the need for everyone to uproot their solutions in favor of Pulsar.


Almost everyone still uses Microsoft Windows.

But almost everyone has heard of GNU/Linux or Apple OSX.


pulsar was seeing quite a bit of hype back in ApacheCon 2019 already


Apache Pulsar should not be confused with another quite popular Pulsar:

Event driven concurrent framework for Python https://github.com/quantmind/pulsar


I love SQS but it isn't quite comparable to Kafka because it is poll-based. It's a solid queue but not suitable for streaming.


My thoughts exactly.


I still can't believe kafka doesnt have a good open source GUI managing tool


This is great: https://github.com/tchiotludo/akhq, using this currently in our environment. There's also Kafka Manager which has been around for a while.

I still recommend managing Kafka through the CLI tools however.


What's wrong with CMAK?


How does it compare to https://docs.prefect.io/ ?


Prefect is workflow (particularly dataflow) orchestrator. Pulsar and Kafka are general purpose distributed streaming engines.


ah, yeah. totally forgot. sorry. i deal a lot with workflows and imagery mixes them up for me.


There are so many pubsubs. It seems like every company and framework has their own.


Kafka is the futherest thing from a library that I can even remotely think of.


I’d hesitate to call Kafka a library...




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: