Why Messaging Queues Suck

The word around the water cooler is that  a queue has yet to be created that I don’t like. Whether it’s RabbitMQ, AWS SNS/SQS, or Google Cloud Pub-Sub,  regardless of the implementation, I love queues to death, gobble, gobble...I’ll eat ‘em up. I mean, what’s not to like?

Not too long ago at a due diligence review, I was presenting my idea for a mission-critical enterprise architecture. The Pub-Sub pattern played a critical role in my thinking. So I did my dog and pony presentation and things seemed to have gone swimmingly well, then later that day one of the attendees at the presentation stopped by my desk and told me, “I like your thinking, but I gotta tell you, I hate queues. I think they suck.”

I was dumbstruck. My world shook. I felt as if I were a five year old and someone had just told me there was no Santa Claus, and I could not imagine a world with no Santa Claus.

My immediate reaction was to flip the bozo bit and dismiss his comment as one made by a guy who had no idea as to what he was talking about, but, I knew the background of the person. He was no dope and he had a boatload of experience. He’s worked in telecom for a very long time, on very large systems. Given his background and expertise, I’d be dumb not to consider his position. Going against every impulse had to defend my ego, I said, “Oh, why?”

And he told me.

Using a Queue is Lazy

“Basically using a message queue to facilitate interservice communication is lazy,” he said. “You should just have one service send a HTTP POST to the other service that wants the information. For a little more work, you get a lot more bang. Let me show you on the whiteboard.”

Figure 1 shows what he drew.

Figure 1: A typical Pub-Sub pattern using a message queue subscribed to a topic.

“In a typical Pub-Sub pattern you have a service that sends a message to a topic (1). If there are no queues subscribed to the topic, the messages accumulate, eating storage resource. Yes, you can configure a topic to delete a message after a time, but still, the topic is responsible to store the message.

“Luckily, in this case I’ve diagrammed, we have a queue subscribed to the topic. The topic could be a list that's populating quickly such as real-time stock transactions for a stock brokerage firm. The topic will send a copy of the message to the subscribers it knows about (2). For example, the brokerage's accounting system as well as another system belonging to the brokerage's official auditor. Then, once all subscribers get the message, the topic will flush the message.

“Now we have the message sitting in the queue waiting for the Service bound to the queue to pull the message (3). That’s a lot of work. Not only do we have to devote resources to getting the message from the publishing service to the ultimate consuming service(s), but the consuming service has to create the queue and then subscribe the queue to the topic.

Bob Reselman Bob Reselman is a nationally known software developer, system architect and technical writer/journalist. Over a career that spans 30 years, Bob has worked for companies such as Gateway, Cap Gemini, The Los Angeles Weekly, Edmunds.com and the Academy of Recording Arts and Sciences, to name a few. Bob has written 4 books on computer programming and dozens of articles about topics related to software development technologies and techniques as well as the culture of software development.

Comments

Comments(24)

Akslp2080

This is an interesting article. I recently saw this way of implementation done by another team in my organization. I personally questioned the approach where they were exposing an API to receive information and if case of processing it parallel, they were using a queue. 

I just wanted to highlight one more benefit of having a queue, which is the consumer does not have to be highly available always and hence can consume message as soon as its available.

How will we solve the above problem? Or it for us to decide on the trade-off between "the cost being incurred on having a queue" vs "importance of delivery of the message".

reselbob

Your is an excellent point. I would have done well to mention the importance of load balancing when implementing a direct to API architecture. I have had situations when message queues have been unavaible. It's rare, but it does happen. Nonetheless, when implemeting an Highly Available API, proper load balancing is essential.

Thanks for the insightful comment.

fedemoya

What about performance? Lets supose we have 1 producer and n consumers. In the pub/sub model the producer sends just 1 message. In the other model, the producer has to send n messages. If the producer is a web server, the time it takes to handle a request depends on the amount of "subscribers" it has, doesn't it?

reselbob

Another excellent point.

To my thinking, it is true in a pure pub-sub pattern, the producer needs to send but one message to the topic and the message management component, AWS SNS/SQS for example, takes care of the the message copying and message distribution to queues. So, there is an apparent labor savings. However, there is the added labor on the part of the producer to make sure that queues subcribing to a topic are indeed authorized to use that topic. So labor you save in letting the topic use manage message distribution is offset by the need for the producer to manage the security on the topic.

In my real world, where we did have a topic publishing to queues, some of which were late bound, we had to devote some labor to create a way to register the late bound subscription queues to the topic and then do the required verifications. Looking back, it would have been easier to know and publish to a URL declared by the party wanting the data.

Still, you have a good point. Your assertion that at some point one might be reinventing a topic architecture is very valid.

Thanks for your insightful comment.

Hannes

Hello, just a short comment from my side: think of the pub service going down. After coming up it will be hammered with registration requests. That might not be a problem with a few applications. But I imagine in Enterprise situations the amount of subscribers (there will be multiple instances of each consuming app).

reselbob

Good point Hannes. As you wisely point out, when implemented a direct to API architecture, significant attention must be paid to fault tolerance and instance reconsitution and subsequent consistency.

I admire your thoughtfulness.

nkuehn

From the cost and scale perspective of a financial services project it's actually a valid consideration to omit the communication infrastructure and move more logic into the publisher.  

But keep in mind that implementing a robust message publisher that has all the qualities a message queue has is no easy task.  

The discussion is assuming a 100% available consumer that never throws errors or fails or becomes unavailable or overloaded (=not enough instances).  

So the producer has no issues like exponential retry (requires persistence on the sender side, equaling cost etc. like described in the articele), dead letter handling & logging, delivery status persistence across multiple subscribers etc. pp. 

If every consumer uses a message queue or other expensive facilities (like described) that's okay to assume.  Otherwise, well, you put a message queue in between or let the consumer pull / poll from the beginning.  

You don't get around having a persistent buffer on on or the other end.  So the cost thing your guy argued is actually just "the consumer needs to pay the cost, not the producer".  But in a distributed system someone has to pay the cost.  And I doubt that implementing and maintaining the logic over and over in every service is more cost-effective than stateless services with many queues. 

A relevant topic I would add to the discussion is that queues systematically guarantee "at least once" delivery.  So to become transaction safer the consumer needs to be idempotent.  A stateless "intermediary" consumer that is no system of record can just "pass through" the duplicate messages, leaving idempotency to the first system of record in the chain.  But at some point, again, you need storage.  

 

 

 

 

J-S

First of all, it is refreshing see the comment section in this blog. So civilized, it doesn't look like the Internet.

I'm still not conviced by the reasons explained here and I have some questions:

- Why the cost of securing the access to the topic would be higher than the cost of securing the service directly? From my undestanding, each producer service would have to build their own security mechanism, is that right?

 

- With direct service calls, you have to make sure that the consumers are always ready to listen, incurring in more cost with availability. In a queue scenario you could have a single server for the accounting system, deploy new versions any time during a day and not lose any information.

In the direct call scenario, you would have to add logic to each producer service to retry failed messages or some sort of similar mechanism.

- Using queues you can have diffent throughput for each systems, depending on their importance and needs. Why the costs with the queue would be higher than the cost of scaling all consumers to support the producer throughput?

I agree that a queue might also be down, but I believe it is much easier to keep running systems that change less often (database, queues) than systems that you want to change as fast as possible (your services).

Does it make sense?

reselbob

Yes, you assertion does make sense. The points you raise are valid and valuable.

Brock-Weaver

In my mind, where this consumer-oriented message queueing technique breaks down is when the consumer endpoint is unavailable.  If there must be guaranteed delivery (such as with "traditional" message queue services), you basically have to implement at least a half-hearted queue within your system to minimize the effect of the consumer's endpoint being unavailable.  And how often, if at all, do you retry those messages?  Where are they stored?  When are they purged?  etc.

The implicit assumption here is that in the "traditional" queue service, its downtime is minimal / negligible / etc. and that messages can always be delivered to it, and it handles all the gory details of message retrying / aging / versioning / etc.

I would think it really boils down to requirements -- if guaranteed delivery is a requirement, this approach seems like a non-starter.

Brock-Weaver

This approach falls down when guaranteed delivery is a requirement (as it almost always is in business systems, at least); any time any consumer's endpoint is down, you would have to store that message for retry later once per consumer endpoint that was down.

i.e. you would essentially be re-implementing a "traditional" message queue within your system to meet that requirement.  e.g. storing / aging / purging / versioning / etc.  This of course assumes a "traditional"message queue is reliable enough to not have to build this into your system in the first place.

I guess what I'm saying is -- there's no silver bullet that works best for everyone.  There is no "right" or "better" way that applies to everyone.  It all depends on what your app is trying to do, and your acceptable tolerance for data loss when the app inevitably fails at some point in time.

vkgfx

Copying my comment from the reddit submission here, in case you don't follow reddit:

The main point seems to be that you're tying up your messages in a distributed queue and paying heavily for them. I don't think that's true. The article seems focused on AWS and cloud hosting providers in general ("Most enterprises that operate at large scale do not roll-their-own in terms of topic/queue hosting. They use a service provider such as AWS. Every topic in play costs money and the storage on those topics cost money."),so I'm going to address it from a "is this true in AWS" perspective. Most of the things I'm saying will probably be somewhat or equally applicable to providers like GCE, Azure, etc. Here's my take:

First of all, SQS costs $0 / GB for storage. So saying "and the storage on those topics cost money" is automatically not true.

SQS is between $0 and $0.09 a GB/month for transfer depending on how much data you're using. I'll do some napkin math, assuming that 1GB = 1000MB = 1000KB = 1000 Bytes for the purpose of making it simple orders of magnitude. But lets say you get up to 10TB of data in there in a month. SQS charges $0.09/GB after 1GB up to 10TB. So that is (10TB - 1GB) * 0.09, or ~10TB* $0.09 = ~$900 for that month. How many messages is that? Well it depends on the size of the messages. But let's say you use the maximum size per message of 256KB. 10TB / 256KB = 10000000000 / 256 KB = ~40 million messages. Let's say you're looking at 10KB messages, which are more realistic for small message pub/sub systems in my experience. Now you've handled 1 billion requests for that cost.

So you have handled **40 million to 1 billion messages** and you haven't even gone into the 10TB+ tiers that lower the cost.

**BUT WAIT THERE'S MORE**, all those costs I just described are data transfer *out of AWS*. It's not always true, but if he's building his whole system on an AWS stack, any competent dev will try to have EC2 instances in the same region as their connecting SQS queues. So how much do those 1 billion messages cost in transfer in this case? **$0**.

The last cost is API requests, priced per million. So lets say this system handles 1 billion requests as described earlier in a month. They are handled at one request per 64KB chunk (4 per 256KB message) and will support batching up to 10 messages as long as it's under 256KB total. So for 1 billion requests at 10KB apiece, that's 100KB per request, chunked into 2 requests <= 64KB. So it is 2 * 1 billion / 10 = 200 million requests. Pricing for standard SQS queues is $0.40 per million, for a total of $80 that month. Now, if the 10TB were in 256KB or so sized messages, and that was calculated at 40 million or so a month previously, that means you're charge for 40 million * 4 = 160 million requests, for a total of $64 that month.

Now don't get me wrong, a lot of AWS managed services do cost a premium, SQS included. I also didn't factor in SNS topic costs. The submitted article described "every topic in play costs money and the storage on those topics cost money. Thus, you are absorbing the storage costs of messages until all subscribers pull the messages down." I'm not sure if he was using "topic" to refer to queues or to the SNS topics that feed them. The "topics" in the original whiteboard diagram were equivalent to SNS ones, but topics don't have any storage any way, they just process events as they arrive. But since I went over SQS costs, I'll briefly mention SNS costs. For SNS to deliver a message to an SQS queue *is free*. So if you have 1 million queues subscribed to your SNS topic, you don't pay a cent more than you would if you have 10 subscribed. Finally, publishing messages to an SNS topic is similar to the SQS cost ($0.50 / million) and splits the 256KB chunk into 64KB ones in the same way. While the article talks about the cost of registering consumers to your SNS topic, it's not really describing it correctly. Yes you "pay the cost" of the subscription in the form of a handful of API requests to SNS (priced per million...), and yes you have to manage who can subscribe. However, anyone competent in IAM policy creation can manage that subscribe policy easily and in a way that can be documented and automated.

If you want to host your own message queues on each consumer to effectively buffer messages as they come in to be processed, you face the following challenges:

1. Reliability. If one of those consumers goes down, they have the message and depending on what queue they're using and how it's configured, they will be able to recover, but until then those messages are just gone. A centralized SQS queue (or any other message queue) allows you to first and foremost get trivial fault tolerance. Spot request terminated? Catch it with the instance metadata store and change message visibility to release them back to the system. Didn't catch the termination, or maybe a rat in AWS's rack chewed through your power cable? Doesn't matter, your messages will be reprocessed within the visibility timeout. (It's up to you to ensure exactly-once processing via idempotent operations of course, but that's true for most queues...)

2. Back pressure from the consuming service. Oh shit, your cool deep learning face swap web app got posted to reddit and all of a sudden your image-processing service has 1000x more traffic. Of course it can't handle this and either runs out of storage (since apparently we're using a lot of storage in our queues per the submission's discussion) or runs out of memory trying to manage its queue locally. At the very least it starts dropping requests on the floor if it's smart.

3. Decoupling of publishers from subscribers. This was hinted on but not really addressed in any way that didn't feel like a strawman. Sure I can have my event generation source just send POST requests to all the relevant consumers, who then squirrel it away in their local queues. But now I'm managing that fan-out either programmatically (ew) or via configuration (not much more fun). Even better, if I'm scaling up and down, that list of consumers needs to be synchronized via something like ZooKeeper or via service so that I'm not sending messages to non-existent consumers. And now I experience a load increase in my producer service proportional to the number of consumers as it will have to publish every message to every single consumer. Have 1000 queues subscribed for your 100 message / sec producer? Shit son, you have to send 100,000 POST requests a second. *Or* you could just send 100 POST requests a second (to AWS) and then have the reliable system built by the geeks at Amazon fan it out to the 1000 consumers for exactly $0.00000050 per message to you.

4. Some of these things can be addressed, but you have to burn developer hours on it. One of the points of AWS and other similar providers is that you pay a premium (in my previous example, <$100 a month for 1billion req/month, or about 380 req/second) to avoid having to address all those previous issues. Sometimes you have enough technical debt that working around addressing them in your particular system is going to be a substantial task. So the question is, is it cheaper to pay their $0.40 per million request charges or to pay your developers to implement all these things. Remember, your devs or ops people have to manage the queue servers, deal with downtime, etc.

Anyways, that's my (sort of) quick thoughts on this. Anyone feel free to correct me if I'm way off base in my conclusions or napkin math.

reselbob

I am glad you copied your comment over from Reddit. The details of your rebuttal are useful to ponder. As I mentioned elseswhere, the conflict created by opposing ideas on a matter is necessary. Peter Senge's calls it the Creative Conflict, if I remember correctly. When we engage in Creative Conflict, a new, maybe better set of ideas emerge. There is little downside to growing an idea.

Thank you for your thoughtful, detailed comments.

shamsm

In the scenario where you have m producers and n consumers, the consumers have the additional burden to register with each producer in your proposed solution. In addition, there now needs to be support for m x n connections between the producers and consumers as opposed to m + n connections in the pub-sub model.

KevenTheOther

The proposed architecture does not bring so much changes: it simply implies that the "broker" component will be part of the servies A & B, and not an independent service anymore.

Moreover, it implies a stronger coupling between services : A and B must be online at the same time to work, if one is in maintenance, the other is broken.

You can avoid the downsides by implementing strong reliability patterns (circuit breaker, retry, ...) which are not required in the broker architecture.

escohoido

You don't touch on client retries at all. Or the fact that without queues, you either create a synchronous dependency on your application for your clients. If you intend to enforce asynchronicity, you need a mechanism to store requests locally on your clients. Now what happens when a client host dies storing queued requests? Do you have an auditing mechanism in place to catch this?

These all go away using queues.

escohoido

There is another major reason enterprises use queues: to eliminate synchronous dependencies, and to keep code paths for failure/success the same.

Without queues, what do your clients do when the HTTP server goes down? Do they just infinite retry? You have just become a synchronous dependency. Do the clients store the pending requests locally? What happens when a client dies with pending requests queued up? Do you have an auditing mechanism to keep track of which requests were sent vs which were supposed to be sent?

We have this argument constantly internally, and solving both those problems without queues has been fruitless. Down below you mentioned queue services going down. Besides us never having an issue with this, to think that you can make an HTTP server with availability higher than a managed queue service is a pretty lofty goal. If you contact managed queue providers they can give you more specifics of their availibility, but I assure it is much higher than anything on HTTP server can hit realistically.

 

escohoido

How ironic that S3 goes down (and SQS with it) just an hour after posting this reply.

"Down below you mentioned queue services going down. Besides us never having an issue with this"

escohoido

There is another major reason enterprises use queues: to eliminate synchronous dependencies, and to keep code paths for failure/success the same.

Without queues, what do your clients do when the HTTP server goes down? Do they just infinite retry? You have just become a synchronous dependency. Do the clients store the pending requests locally? What happens when a client dies with pending requests queued up? Do you have an auditing mechanism to keep track of which requests were sent vs which were supposed to be sent?

We have this argument constantly internally, and solving both those problems without queues has been fruitless. Down below you mentioned queue services going down. Besides us never having an issue with this, to think that you can make an HTTP server with availability higher than a managed queue service is a pretty lofty goal. If you contact managed queue providers they can give you more specifics of their availibility, but I assure it is much higher than anything on HTTP server can hit realistically.

 

scumola

What about requeueing un-ack'd disconnects?

Tony-Weston

A little late in the day now, But I'll add my 2c.

If the queue you are using is durable / reliable, then your system might become Partition Intolerant, Which is a really bad.

https://codahale.com/you-cant-sacrifice-partition-tolerance/

For example, if your development team have been coding away, thinking we do not need to worry about the network going down. Our super fancy RabbitMQ or whatever (it doesn't matter what messaging system you use), will take care of partitions.... Messages will not be lost. Yay!

Then your system, is NOT built with partition tolerance in mind. And, at some point in time, something bad will happen. Servers will slow, millions of messages will queue up, stuff will end up in a state which might take hours, or days to flush through all the messages.

At this point, a decision will be made to purge the queues. dumping all the messages. AND at that moment when the 'Purge' button is pressed, you have a partition. If your code hasn't been built to handle this, it will fail. you will end up in shit, trying to restore backups, or regenerate messages , or other jumping through hoops to get back up online. Because the developers thought they did not have to worry about partition tolerance, because the Queuing system handled it for them.

Beware! Partition Tolerance is not cured by message queues.

alediaferia

What if Service B is down at the time Service A needs to push to it? How does Service B recovers the message?

ymolists

I think one big omission you are making is when you are dealing with services that want to subscribe to a "queue" at a later date like a batch or offline integration system. If i want to replay all the "messsages" that happened on a particular topic i am SOL now :((