- Avdi Grimm (twitter github blog book)
- Charles Max Wood (twitter github Teach Me To Code)
- David Brady (blog twitter github ADDcasts)
- James Edward Gray (blog twitter github)
- Josh Susser (twitter github blog)
Definition of Queuing and Background Processes:
- Queuing is about messaging. Typically first in first out (FIFO)
- Background Processes are processes that pull messages off the queue and do things later.
Systems that people have used
Poor man’s queue or database backed queue
What do you look for in your queue technology?
- How it takes failure
- It depends on what you’re doing
- Regular tasks vs. Immediate tasks
- One worker vs several workers on several servers
- Google’s Geocoding API – Rate-limiting with cron
- Amazon’s FPS – polling amazon to get updates on payments
- Structure jobs so they are simple input/output
- Isolate your jobs as much as possible from the database
- Decouple your application from your queue – Message Passing
- Distributed == Not Dependent – If the job dies and you are broken, you’re designed wrong
- Idempotence on the job
- Messaging Queue == State Machine
- Logging/Emailing when something is unprocessed for too long
What do you send to the background?
- Payment Processing
- PDF Generation
- Billing reconciliation
- External Web Services
- Collate events into an aggregate event
- Farming out multiple jobs to multiple queues
- Separate unusual resources onto other servers
- Service-Oriented Design with Ruby and Rails (Addison-Wesley Professional Ruby Series) (Dave)
- Enchantment: The Art of Changing Hearts, Minds, and Actions (Dave)
- If you gaze into nil, nil gazes also into you (Avdi)
- hacker monthly (Avdi)
- rubythere.com (Josh)
- code for america (Josh)
- Philosophy Gym (James)
- Philosophy Bites (James)
- Ask a Ninja (Chuck)
- Ruby 1.9 (Chuck)
- RVM (Chuck)
CHUCK: Hey everybody, and welcome back to the Ruby Rogues Podcast. This is your host, Charles Max Wood. Before we get started, I wanna point out a few changes that have gone on with the podcast. Last week was not necessarily Aaron’s last week, but he’s not going to be a regular panelist on the podcast, but we’ll have him on whenever he’s available. He’s busy making Ruby and Rails awesome right now, so we told him that he could do that instead. Avdi Grim will be joining us as a regular panelist.
CHUCK: Yeah, he’s a smart guy, so we’re happy to have him. We are going to go ahead and do the introductions a little bit differently this time. I’m just going to tell you who the panelists are, and then they are going to go ahead and tell you what’s going on with them. So since Avdi is now our new regular, we’ll go ahead and start with him.
AVDI: Hey, I’m Avdi, and I blog about software and Ruby type stuff at Avdi.org. And I also have a podcast about distributed teams at wideteams.com.
CHUCK: That is awesome. He also wrote Exceptional Ruby, which is an awesome book. Josh Susser.
JOSH: Hey, I’m Josh. I’m a co-organizer of Golden Gate Ruby Conference. I’m currently a self-entrepreneur, which means you won’t get to know what I’m up to. And an old Smalltalker from way back.
CHUCK: Yeah, we don’t even know what he’s up to, so don’t ask.
JOSH: Sometimes, I wonder if I know what I’m up to.
JAMES: So do we.
CHUCK: Yeah. Another panelist that we have today is James Edward Gray.
JAMES: Hi. I’m James and I’m super excited that there could be saltwater on Mars.
CHUCK: [Chuckles] All right. And David Brady.
DAVID: Oh crap, how am I going to talk to saltwater thing? Okay, hi I’m David Brady. I own Shiny Systems. I blog at heartmindcode.com, and that’s me.
CHUCK: All right, and I’m Charles Max Wood. I’m the host of Teach Me To Code. Teach Me to Code has a podcast and a screencast. And I do a bunch of other stuff in the community, so you can follow me on twitter if you wanna find out what that’s all about.
So today, we are going to talk about Queuing and Background Processes. It seems like it’s a little bit more Rails focused this week. Does anyone have anything they wanna go ahead and start with?
JAMES: Well, should we just talk about what queue systems we’ve used in the past?
CHUCK: Sounds good to me.
JOSH: Can we start with a definition?
JAMES: Sure. Go for it.
JOSH: I always like it when James gives a definition.
JAMES: Wait, why me? You are the one who always ask for the definition.
JOSH: [Chuckles] Well, you got to learn to step up first.
JAMES: Sorry. Wait, so I should ask for work for myself? This is getting confusing.
Okay, definition of queuing and background processing. So queuing is about messaging, right? It’s about putting messages in a queue — typically first in, first out message system — and then pulling them back out.
And background processing is how queuing is most often used. So you ran a Rails application, you need to do something long and expansive or potentially expensive, you throw it into queue and say, “Do this later,” and then you have a separate process or something puling those messages off of the queue and doing it. That’s my best definition.
CHUCK: So background processing is kind of like technical debt, except it belongs to your app instead of your coders?
JAMES: Right and it’s immediate.
DAVID: What would have been really awesome is that if Josh had asked for the definition of a queue and if James have said, “Okay.” And then five minutes later, I had provided the definition of queuing and background processing.
JAMES: That would have been better. We need to rehearse that next time.
CHUCK: Yeah, well I’m actually successfully recording this time, so you know, otherwise I can work that in.
JAMES: [Chuckles] Although we did have an amazing pre-show this time and I’m still laughing hard about…
CHUCK: Yeah, next time I´m just going to hit record as soon as I make the call, and then we can adjust.
JAMES: So Josh, was my definition satisfactory?
JOSH: Sure. Sounded good to me.
CHUCK: All right, so what systems have people used?
JAMES: I’ve used that quite a bit too.
DAVID: I’ve used RabbitMQ, ZeroMQ, bunny — which are all basically kind of the same thing.
CHUCK: I’ve used Delayed Job and BackgroundRB.
JAMES: I’ve used Resque.
JOSH: I’ve used Resque quite a bit, Delayed Job, Background Job.
JAMES: Yeah, I used that one too. It was older system. And I even had a fork of it for a while that I have used on one project.
CHUCK: I’ve also used Cron jobs. I don’t know if that qualifies necessarily as the same thing.
JAMES: Sure it does. I should throw out a definition here as well. Is anybody familiar with the term, ‘poor man’s queue’?
JAMES: Okay, so what’s that?
DAVID: All right…
CHUCK: It’s the opposite of a rich man’s queue.
DAVID: It’s the opposite of a real queue. A poor man’s queue is… and this might be just regional slang up here, but a poor man’s queue is when you set up a database table and you store record or a job on the table and then you have something that comes along every minute or so, and read the table to pick up a job and dispatch it out. You basically build it in the database as basically some persistent data store that’s just going to collect a series of sequential data and something that comes along and picks it up.
AVDI: So, Delayed Job?
JAMES: Delayed job, that’s what I was going to say.
DAVID: Yeah, Delayed Job and BackgroundRB are both poor man’s queues. And you can see that if you go and look at your Rails logs, you will see BackgroundRB every five seconds or so, you can see it querying the jobs table.
CHUCK: I have to wonder then, we have all these different systems, what are the tradeoffs between them? What’s the tradeoff between say an AMQP setup like Rabbit versus a poor man’s queue?
DAVID: Poor man’s queue is trivially easy to set up. They are dead simple. You can understand it just by looking at it. AMQP is a cast iron bitch to setup. I mean, you got to get Erlang to compile. You got to get the right version of Erlang to compile, then you got to figure out how to start the Erlang server which is going to involve reading some Erlang crash dumps because you’ve set it up wrong. And you know, you’re chasing all these things and you have all these kind of headache. And then AMQP or with RabbitMQ, you have to set up the actual queuing directions. There’s a fan out or there’s a distribution or sequential point to point that you put things in here and take them out here.
Poor man’s queue, there’s just a table. You just stick it in a table, somebody will come along later and read it from the table and work the job. And so, with simplicity comes difficulty of scale; you are hitting the database, you are hitting the master database, you have to write to the database every time you wanna submit a job. And getting a job back out of that queue, you’d got to read…. Well, that can be done on a …, I guess. Actually I take it back. If you wanna pick up a job, you have to have right access to the database because you’ve got to flag that job as picked up and somebody’s working on it. Don’t do poor man’s queues — they’re a bad idea.
JOSH: I wanna take exception with that.
JOSH: You’ve described that from the perspective of a particular use case.
JOSH: And there are plenty of use cases where you are calling a poor man’s queue, and I would call the database-backed queue, it’s perfectly acceptable and it’s actually really good solution for your application.
DAVID: Okay, I’m willing to accept that poor man’s queue is an proper application of the database-backed queue.
JOSH: But database-backed queues are well, like you said, they are very simple to set up for the most part. And that it’s nice that you don’t have to put another piece of technology into your stack and if your volume is low enough, then there’s no issue. If your database is set up right, then it can be just fine. Now I think the database that you are using can have a big impact in what you are doing, MySQL doesn’t have awesome locking functionality, so there can be a lot of overhead with that and can take longer than it should. There is something that’s interesting that I haven’t had a chance to try yet that looks really cool is Ryan Smith’s Queue Classic.
JAMES: I was just going to mention that.
JOSH: Well, you got to wait in line.
JAMES: Sorry. I’m queuing up.
CHUCK: Yeah, you haven’t been picked up by the worker yet.
JOSH: I got that locked on the table. So yeah, Queue Classic uses some of the awesome PostgreSQL locking features, and pub sub semantics to be able to do the storage for your queue.
JAMES: Sorry, I couldn’t wait…
DAVID: Priority queue!
JAMES: Priority queue, right. I actually hacked on Queue Classic with Ryan at Red Dirt Ruby Conf, And we added signal notification to the workers at the hack fest. So I know a little bit about it and have been playing with it. And it’s kind of an interesting hybrid because it’s kind of what you guys are calling a database-backed queue and there’s just jobs in a table. But as Josh said, it uses PostgreSQL’s excellent locking to do it. So when you pull a job out, even though it’s just job in a table, you do actually have a lock on that job. And it’s very interesting using PostgreSQL locking semantics. And then it does do a notifications, PostgreSQL notifications systems, so you don’t have the typical polling; checking very so often to do that. A worker can sleep and wait for notification, in which case PostgreSQL will wake it up and say, “Hey, time to get busy.”
And so it’s interesting in that it kind of sits on the boundary of that two queues. It’s very sophisticated at least for a database-backed queue. So I do like it. The interesting thing about database-backed queues is the good point of them — in my opinion — is that getting information about the queue is so easy, right? Just write a SQL query and, “Oh yeah, these are all the jobs that are currently in this state,” or whatever.
DAVID: And that’s a dangerous trade off too… I´ll let you finish. You’ve got the priority queue. [Chuckles]
JAMES: [Chuckles] well, I was just saying it was easy to get information. But I pretty much agree with David that as soon as your numbers started to get real, I think database queues start to suffer pretty bad. I mean, even if you have really good indexing on them, if you are moving jobs to write a big rate, then it’s probably not going to work out. But like Josh said, it’s very simple to set up. So if your needs are modest, I’m sure you could do it.
JOSH: And I just say, in your development life cycle, you probably wanna start with the database-backed queue, just because it’s simple to set up, and you don’t know when you are really going to need the performance from a high volume queue anyway. Twitter started their queuing with Starling. It’s this simple little queue system that Blaine wrote in Ruby, and it was really adequate for what they did for a long time and then they eventually have to move off it because it couldn’t keep up. You just start with the simple stuff.
DAVID: There’s one good thing to do is that… James mentioned that a feature of a database-backed queue is that you can go in and just say, “Hey, how many select count from the job’s table?” And you really wanna resist the urge to do that, because the fact that you can query how many jobs are in the jobs table, means that you implicitly have global perspective. You can see the whole universe, which means and the queue and all the jobs are in one table on one server. I found this out the hard way when we were using Amazon’s SQS, which is another queuing service that we forgot to mention. We got this queue and I wanted to know if the queue was backing up and I wanna know how much the queue is backing up. And so, I went in and said all right, where in the API do I ask it, how many jobs are in the queue here. And there’s no API to do that. There’s no way.
And the reason for that is because the queue is on 17 different servers. A single queue might be able on multiple servers. It might go halfway through one server and then get dropped off, picked up by another and then run through there. We’ll talk about reliability later on the podcast I hope, but once we’ve got good grasp on reliability on the fact that you shouldn’t ever trust your queuing service, that you need to be able to allow it to die and go away, then there’s a really fast optimization you can do. A really good optimization you can do with the database-backed queue, especially in low to mid volume, and that just makes the engine for MySQL… make it the in-memory engine and now you basically got in-memory cache.
JAMES: Yeah, but the only problem there being if you shove too many jobs and it kind of starts to melt down.
DAVID: We’ll talk about reliability coming up. But you should not depend on it. [Chuckles]
AVDI: I’m curious to hear what people think is a… what people look for in a queue system when they are choosing one over another. I´ll throw mine out, which is I think the biggest thing for me is how it handles failure. I mean, the most obvious thing is that the queue server itself doesn’t fall over and die when one of the jobs fails. But beyond that, being able to see which jobs have failed, to sort dig in to the context of where and how they failed, and then to be able to say, “Okay, I fixed this condition, now I want to re-queue the jobs that have failed.” That’s a big deal to me.
JAMES: Yeah, I think that’s definitely one of the things I care about is being able to bump those things. Lately, I liked a little introspection in my queue. I know maybe that’s not traditionally a big part of it and it’s even kind of frowned on in some scenarios, like David Brady mentioned. But when I’m looking at problems and stuff, I just find that it’s… I don’t know, I just love being able to peek inside and see exactly what it is thinking right now.
Well, I used to be primarily a Beanstalk guy – I used to use that almost exclusively. And I do really like Beanstalk because it’s so simple and stuff. The two problems I’ve had with it is when jobs do get back up and stuff and some of them fail and then you go fix something and you want to kick them back, it’s kind of a pain in the but if you want to just like re-kick this one job instead of all of them or something like that. And Beanstalk’s introspection is basically isn’t there, you know? So, I really don’t like that. So these days, I tend to be more of a Resque user just because I like the introspection of the web application that I can just peek inside of.
DAVID: James, have you seen my Beanstalk utilities?
JOSH: Nope, probably not.
DAVID: I probably ought to wrap this up in the gem. If go to GitHub.com/dbrady/bin or maybe ‘binfiles’ no, just ‘bins’ there’s a bsput bsget and a bstubes, and the bs tubes will just dump you out of table showing you all the tubes that are currently in play, how many jobs are ready, how many jobs are ready to go, how many jobs are kicked or what’s the… buried, that’s the word. So I recognize that I’m being a total hypocrite. I’m not saying I don’t love that stuff, I’m just saying you shouldn’t want it. [Chuckles] But yeah, with Beanstalk, you totally want that.
CHUCK: I’m going to jump in here and answer all these questions as well. As far as like which queue or which queuing technology I’m going to use, it really depends on what I’m trying to do and what it does well and what makes it easy for me. So in a lot of cases, what you are talking about is really all there is, because all I’m really looking to do is drop some work in there and have it happen; have some worker process or something pick it up. But you know, sometimes it turns out that I need a whole bunch of stuff done across several servers, and so I’m going to pick something that can fan out the server or has some library that allows me to do that, versus something that it’s just a queuing technology and I have to write my own workers that come and check the queue. I have to manage all of the consistency myself.
The other thing that it usually consider is in the case of when I need something to happen like every ten minutes or every hour or something, I mean this is why I brought up Cron is because it’s built into Unix, it’s been around for freakin ever, it works like a charm and so all I have to do is set up a rake task and tell Cron to call it. And so depending on what problem I’m trying to solve and what kinds of things I’m trying make happen, I may choose one feature over another. Does that makes sense?
DAVID: Absolutely. When I worked at … we had to process, we had to take a podcast and get to stitch audio ads on the beginning and the end of the podcast. And this involves ripping the MP3 apart and mixing in new ads every week or every night. And we had to process the entire inventory overnight and so this was a nightly Cron job, and we would shard it off across 100 different workers to go through. And on the other side, it was either… oh, I was going to get the name wrong, it was either Ryan Bates or it was Jeff Grosenbach that did the Beanstalk screencast, and he talks about how on his blog, he wanted to process all the new posts through a … to see if they were spam.
And he wanted to do it asynchronously, but he wanted it done as fast possible. And so he used that with Beanstalk in just as fast as he could and he actually found out that the Beanstalk worker could actually take the job off the queue, send it out to a … get it back and find out that it’s ham, not spam so it’s valid, save it in the database and approve it — before the users page had finished loading. And so they would actually click submit on their post and the web server will grind and grind and grind for 3-5 seconds and the page would load and the comment would be there and the it would be approved. So yeah, two totally different use cases.
CHUCK: Hm-hm. That makes sense.
JAMES: I think Cron is definitely a valid tool. It’s very useful and I do use it as part of most of my workflows; if I wanna kick something in on the service readily, then I wouldn’t have necessarily kick in jobs into the queue and run and in that way, I usually run it through Cron. You do have to be aware of some things that are there. Like for example, Cron will just happily fire up multiple copies in the same process. So like if you have a job that’s taking a long time and you are running it every hour, once you go over that hour mark, another one will kick in and start working, while the other one start already there working. So you do need to stay aware or stuff like that.
DAVID: Yeah. The first thing you have to do is check if it’s already running and shut itself down.
CHUCK: Yeah, but either way, you have to be aware of its tradeoffs.
JOSH: So I have a weird, Frankenstein example or a mutant creature, I don’t know. So I’ve built stuff that was a combination of a queue system and a Cron job, so my particular example had to do with geo coding and we were using a well-known provider’s API, can I say Google? [chuckles] to do geocoding.
JAMES: Bleep that! Bleep that!
JOSH: And it was rate limited, so we had a whole bunch of data points to geocode, so we build a system that used a queue; it queued up all the work that we had to do and we were incrementally adding to it, but we had a Cron job that would wake up every so often, and do some of the geo coding — pulling the items off the work queue — but we were using the Cron job to rate limit it. So there are plenty of situations I think where you want to be feeding the queue from your application, but pulling things off at a uniform rate, because there’s sort of a rate limited thing going on.
JAMES: I have pretty much a similar scenario right now with kind of complicated payment system, where I’m using Amazon’s FPS. I don’t know if you’ve used it, but FPS is so weird in it’s kind of this layered payment system — and I kind of think of it as ‘circles of hell’. And it’s how far down you need to go for this specific feature you want. And luckily, the only feature I needed is on the very bottom circle of hell, so I had to go all the way throughout the circles.
CHUCK: So did Satan put you in his mouth and chew?
JAMES: [Chuckles] It was bad. Yes. And I deserve that for the dull comment last time, right?
CHUCK: [Laughs] If you haven’t read Inferno, the lowest level of hell, Satan’s there and it has Judas and two other people and his three mouths and he’s chewing him forever.
DAVID: What if a fourth person shows up? Do they have to time share?
JAMES: So anyways, I queue these jobs, these payments, but the way FPS works is submitting a payment is almost a no op, it doesn’t really do much. You just say, “Here, do this.” And then it’s like, “Okay, I will.” And then you need to basically DOS Amazon to find out what happened to your payment. So anyways, I queue them all up and fire them off. And so I have the queue from the application, I enter into the queue and then I have to work with it. And then all it does is pull them off and throw them in to Amazon and that’s it. And I have a Cron job that kicks in about every hour, and goes through and does update on all the payments that don’t have any kind of a final status yet. So I’m trying to figure what’s going on with them. So yeah, there’s definitely I think hybrid approaches and things you need to do that way.
AVDI: I’m kind of interested in like what are some good best practices or better practices for working with queues. I have one that I can start with, which is kind of hard one knowledge which is if at all possible, structure your jobs so that they are simple input output processes, where all of the input that they need is in the message in the queue. And then they put their output somewhere; either they put the output on another queue or then they drop it into a database table which is from their perspective is write only, but there isn’t a lot of back and forth between a database and the job once it’s running. And that has a ton of advantages, it’s easier to test if you decide you wanna move to a different queuing system, it’s usually a lot easier make that move, and it’s way, way easier to debug failures in the jobs if they are just simple input output machines.
JOSH: Avdi, I wanna expand on that just a little bit because that’s a great point, but the loading up the job with the data I think could use a little clarification. I’ve done it before where you got a job and you say, “I need to email a user,” so what you put in the job record is the user id, and then you fire up worker and first thing the worker has to do is to look at the user id and then it goes and gets the user record, pulls all the data out of that record.
JAMES: And you already screwed up.
AVDI: Right and that’s not what I was talking about. I mean, there are certain cases where you can’t get away from referencing something else, if you are processing a big old video file or something like that. It probably doesn’t make sense to stick to the file in your queue. You are going to have to stick a reference to it somewhere. But yeah, not just like user id, but actually serialize that user along with other data that is going to be used, and have it not try to drag that stuff in. The only other advantage that I forgot about is I’ve seen systems where the queuing, they went to queuing for performance reasons and it wasn’t helping because the jobs were running into the same database bottleneck, that the inline page processing was running into.
JAMES: So that’s an excellent point. I think I can give a good example of it. We did Go Versus Go for last year’s Rails rumble. I worked on that with Ryan Bates. So what we’re doing there is your making moves in a go board, and then we need to go to GNU board in the background and get the computer’s response if you are playing against a computer player. And so, we did exactly what Avdi said; when we pushed the job in, instead of just passing like game id or something like that, we actually passed all the criteria you would need. So here’s the position, here’s who’s move it is, blah, blah, blah, we queued that and then when it gets the job, it just goes straight to GNU, no query with the database needed, it goes straight to GNU go and says, “Please generate a move for this scenario,” and then it packages up a one query to shove that back to the database.
And we did that in such a way that the query, as it updates, it will only update if certain conditions are met; like there’s a where clause on it that says, “If it’s still this move for this player, kind of setup, then here’s that move.” So that way, even if something goes weird, gets out of whack, that one query gets issued, of us doing something like well, “It’s still his move. Okay, let’s go ahead and do it,” and all that. We don’t wanna chat with the database at all because in go versus go, we need to move very fast. And in that case, we actually don’t load the Rails environment and stuff. We needed to be pretty lean, so we just load a database driver and do the query manually. We only have to issue that one query, and it really turned out to be a massive win. We used Beanstalk and like I said, the database driver for the one query and did that. So it was a great example, and that’s what we did.
DAVID: I find that it’s a wonderful parallel between this and just solid programming design. remember when we all learned that if at all possible… you remember ‘gazintas and gazadas’? What gets into the function and whatever gets out of the function? All the gazintas, if possible, should be the parameters to the function. There should be no inputs to do the function — if at all possible. If you have to, go ahead and use an instance variable on that object. If you have to, go ahead and access that database. If you have to, go ahead and access a singleton. If you have to — god help you — access a global variable. But if at all possible, your Gazintas should be the parameters to that function.
And the gazadas are the same way; don’t write to the database, don’t write to an instance variable, don’t mutate state – just whatever your return value is. And queuing kind of has the same thing; if at all possible, what you send in to the workers are the arguments that it has to work with, and what it puts on an output queue or if it’s a terminal worker that just writes it’s up or it’s done, then whatever its side effect is its output, I love the similarity in that because the motivations for breaking those rules are very similar in both cases.
We’re processing PDFs here, and they are big documents and they are in a store that has to be validated and verified by, like an auditing process. And I mean, actual capital A auditors, like there’s another department in the building that has to come audit because they are contracts. And okay, you know what, this worker is going to have to talk to the document store — there’s just no way around it. He’s going to have a side effect. But okay, all right, in order to test that, then whenever we bring up this worker, we have to change where he thinks the document store is. Does that kind of make sense?
AVDI: Yeah, I think that some of the things that bother me a bit about things like Delayed Job and BackgroundRB is that, the kind of classic leaky abstractions where they try to make it look like, “Oh, it’s just like calling a method. It’s going to be just like calling a method, except you are going to be calling it and it’s going to become in the background, magically.” And Rails code in general, has a lot of external input or has a lot of inputs besides of the direct input for the methods. And so I see a lot of systems where it’s just the jobs are tremendously coupled to the Rails application and to the database.
And they do a whole lot of talking back and forth, because of this attempt to make an abstraction where probably, it’s a little too leaky. It’s like that classic rpc leaky abstraction where seriously, we’re trying to make a method call across flaky network link; we can’t really pretend that that’s like a normal method call; we just have to bite the bullet and say, “This is something different from a method call, it’s message passing,” which is sort of thing that all the folks that were doing rpc figured out and started moving towards messaging systems rather than rpc systems.
DAVID: That’s actually the best practice that I wanted to contribute to this particular around the horn, which is if you move to a message passing architecture or distributed architecture, the first thing you have to get just hammered through your skull is distributed == not dependent. If you put a job on the queue, and that job dies or it breaks or it’s mishandled, if you are now broken or if you have lost that job, you designed wrong. When you give that job away, it needs to be fire and forget. It needs to be fault tolerant. Because even having a big queue like AMQP or ZeroMQ or one of the big Erlang ones, they’ll run away from you.
And sometimes the only way to get the server’s attention is to pull the cord out of the back of it. And now you are just lost all of those jobs. Well, have you lost leads out of your database? Have you lost your business? Have you lost money? Because there was something critical in that queue. If there was, you did not design a fault tolerant system; you did not really design a truly distributed system; you designed a big ass rpc with this thing waiting on this really flaky and really untrustworthy system to eventually maybe come back and say, “Oh, here’s your job. I’m finally done with this .” And now you can be sane again. You have completed your job and your work goes.
So my best practice that I’ve come across — and this is not perfect in every case, and it’s also got it’s tradeoffs, but I try to strife over a rule that basically says that the queue is never ever allowed to be authoritative. If I’m in charge of getting some work done, and I’m going to use a queue to do it, I set up a job and I give it to a queue, and the workers go off and it does its thing. But I think I own that job and I’m responsible for getting it done. And if enough time goes by and I haven’t heard back from that queue, I’m going to send off… I’m going to put it on a queue again. I’m going to send off another worker to go do it.
Now the danger or the trick to this is you have to have idempotence on the workers. If a worker finishes job, he has to either check the database to see, “Is this job already done? Oh, it is. Okay, I’m going to die. I’m just going to throw myself on the floor. I was too slow.” But if the first job finishes and he looks at the job and he says, “Oh look, it’s not done. Okay, cool. Locking the table. Yup, it’s not done. I’m marking it done, I’m done.” All right, you can see there’s a problem with resource hogging there, that I’m locking the database table to ensure that this work has been done. But the advantage of this now is if your queues ever blows up, if colo blows a transformer up and your server goes away, like what happened in Rackspace back in 2009, if shit happens, no big deal. Just tell your system, “Hey, any jobs that aren’t done, start them over.” And it doesn’t matter, you can start them 100 times and okay, your worker is going to be busy, it’s going to burn up all the CPU time, but whoever finishes that job first is going to mark it finished. And when delayed worker finish and show up, then they’ll say, “Oh, okay. We are done.”
CHUCK: Is there a system out there that does that, Dave?
DAVID: Uhh… no. Well, yeah the system that we built at Public Engines, you and I, is built to do that for the geocoding and the processing of the data.
CHUCK: Right, I guess I was aiming for a little more public project.
DAVID: No, I don’t know if anybody sane has done it, no.
JAMES: Well, a lot of queues have… Beanstalk for example has where you can reserve the job and then work on it, and then mark it as ‘done’ or it automatically goes back to the queue — which is a part of what you were talking about – but not the whole thing.
DAVID: Yeah, if you look at Beanstalk’s documentation though, there’s a caveat in there that says, “Even if you turn on the persistent login, now you have to set a variable that says, ‘How often do I flush the disk? If I get shot in the head, and I’m set to flush the disc every 500 milliseconds, the last 500 milliseconds jobs are gone. They are not flushed to disc.’” And you can set it to zero, but even still, if you are writing your code naively, and you are just expecting Beanstalk to accept the job, it is possible for your code to try and hand off a job, and Beanstalk to die while accepting that job and it never got flushed to disc. And if that’s a $5,000 sale — credit card process — your accountant is going is going to have problem with that.
JAMES: Yeah, I very much agree with what Dave has been saying. I just want it add it a little. Whenever I’m using a messaging queue, I think state machine. So I put enter in my database and it’s in some state. Ad so I make sure I did it in the database, and it’s in some state, then I know, “Okay, I’m here. I’ve got it. We’re good.” At that point, after it’s in the database and I know it’s in some state, I kick it to a queue, right? And that’s another state. It’s waiting to whatever. And then the queue, when it’s done it’s work, its job is to set things the way it needs to be and transitioned to another state, meaning done, right? And this job has been completed and that’s taken care of.
And then like Dave, I always try to design where if somebody fires a machine gun into a computer that has my queue on it, I don’t care. It shouldn’t affect me — if at all possible — or some cases where that’s very difficult to do. But I wanna make it where the queue isn’t important to me. And generally, whenever I’m building a queue, I also a rake tasks that let me go in there and say, “Okay, re-queue all pending payments,” or whatever. You know, so that whenever something goes wrong, I just go in there and fire that one rake task and I don’t care anymore. It went through, it found all the pending ones and threw it back on the queue.
And then another thing I like to do just to take that one step further, I usually like to think, “What’s the reasonable amount of time I would expect something like this to sit in the queue, or how many times would I expect this to queue up?” I usually double that number and then write a cron job that goes through. And anything that’s been in that state, double the longer… the reasonable amount of time I thought of, I wanna get an email about it. I wanna know what happened.
DAVID: James, you and I are exactly in the same wavelength. We call it, ‘the janitor process’. The janitor ran every night, and if he found anything that was unprocessed, he throw it back on the queue and says, “This needs to be cleaned up.” But then it sends a message to the management saying, “Did you know I had to clean up this job?”
JAMES: Right, I really think I can’t stress enough that… I try to design my systems where when they get in to some scenario I didn’t imagine, the first thing they do is throw themselves in some kind of error state and start begging for help, right? Because the problem with a queue is that it all happens in the background, while nobody is watching, so I tend to miss things. “Oh, I never counted on the fact that that would go wrong. I didn’t think about that happening.” So usually, I find that if I can them in that state immediately, especially if it’s an intermittent failure of some kind, if I can them into that state and then get me looking into it while maybe the failure is still happening, it helps me figure out what’s going wrong and what I’ve been thinking of.
DAVID: Now, I’ve never had to do this with the queue, but it now occurs to me that it might be valuable also to notice how many times a job has been shut down in the queue. And if it’s been shot down the queue 53 times, and it’s never come back, there’s a unit test there that needs to be written. You got a job that cannot be finished.
CHUCK: All right, I wanna jump in here I wanna kind of change tracks here for a second because I know that some people wondering, “Well, what do you use background process for?” And we’ve talked payment processing and emails, and I know that there are myriad of other things that we have used them for. What kinds of stuff have you guys used them for?
JOSH: PDF generation, billing reconciliation, geo coding stuff, sending emails…
JAMES: I would expand on geo coding stuff to be basically anything you need to hit an external web service, where it should almost at least be done in the background. And then the other one you said, sending email, this one is interesting to me because it seems like everybody says, “Yeah, you should put that in the queue,” or whatever. I actually don’t agree that that should be in like Beanstalk or something like that. Send mail is an email sending queue and it works great, with fail over and all kind of stuff. So what you really need to do is set up your box correctly and let Send Mail take care of that like it’s supposed to.
JOSH: Okay, so you can’t do that all the time — especially when you are running on a EC2 based cloud system — because you can’t send mail on your box there, so you usually use something like send grid or what have you. An external service that sends your mail.
JAMES: That’s a good point.
AVDI: Wait, you can’t use send mail?
JOSH: No, you can’t use send mail out of ec2.
AVDI: Oh. What?
DAVID: Wait, what?
AVDI: Because I mean, I’m taking it back to like … days where we running everything in the cloud, and I remember transitioning from doing direct SMTP sends to using the built in post fix or whatever it was on all the boxes.
DAVID: Let me just put the question back to Josh. Why not? Is that like blocked by policy or they are trying … on spammers?
JOSH: Yeah, it’s an anti-spam provision.
JOSH: And I don’t know all the details, but I know I’ve never been able to do it. and Amazon even rolled out a product earlier this year to do mailing from their cloud servers.
JAMES: Right. But Send Grid still do.
JOSH: I like Send Grid, it’s pretty good. So yeah, I think you are right James that if you can do send mail in your machine and the latency is nice and low, that’s fine.
CHUCK: Yeah, that’s want I was going to point out is that if you are making round trips to your email server and you are sending out one or two emails, you are going to have a slowdown. I had a project that I was working on for a client, and basically, what would happen is if somebody commented on an article on their website, then everybody else who commented was supposed to get email saying, “So and so said this on this article.” And there was one article that got a tone of traffic, and a ton of comments and after a while, it was literally taking like 4 or 5 seconds for it to send out all the emails. And so yeah, we wound up back grounding that thing, because otherwise, it would just take too long and people would actually put the same comment in again because they didn’t know that the system had already accepted it.
DAVID: Yeah, you had a flash bomb on your server. So Send Mail, I wanna say just uses flat files on a disc, so writing an email to the database is almost certainly going to take longer than it is to give it to send mail. On the other hand however, if you flash bomb the mail server, you run into problems like the Linux file system running out of space or the disc quota for the send mail the user and that kind of thing. That’s technical stuff. Who cares, right?
CHUCK: Yeah I think it just comes back to a lot of what we are talking about in that, we have this trade off, right? Something can be done quickly, serially, in our app, then that’s what we do. But if the tradeoff is “Gee, this takes way too long,” or “This process really shouldn’t worry about whether or not this other stuff is going on,” then we push it off to the background process because the tradeoffs are worth it. And it’s the same thing when we are choosing our background processor or queue system or whatever, or writing our workers is we are making decisions as far as the tradeoffs in how we handle the processes or the jobs that we’re getting, and how they behave and how fault tolerant we need to be and all that stuff. We’re making tradeoffs in the amount of work we do, and in the way that it’s all handled, so that it meets our needs.
JOSH: I have one more use case that I wanna add; and that’s you can use a queue to collate or collapse, combine, a number of events into an aggregate event. And so like if you are on Facebook and you do a status update and five of your friends comment on it all within a minute, you don’t wanna get five emails saying, “All these different people commented.” You wanna wait a couple of minutes and then send an email saying, “Oh, a, b and c all commented on your post.” And so it’s great that you can go and say, “Okay, I’m going to throw something at the queue and then when it comes out of the queue a minute later, I´ll go and I´ll send out something that combines all of that information.”
DAVID: That’s interesting. Because the example I was going to give is actually that example, only run backwards in time. Which is we’ve talked a lot about queues, you wanna use a queue anytime stuff is going to be too slow. You also wanna use a queue when using a queue will make things go much, much faster. And by that, what I mean is let’s say you go and you update your status on your web service and it needs to update in Twitter, it needs to update Facebook, it needs to update Google+. And if you do these sequentially, you’ve got to wait for all three of these three services to come back.
But if you throw it on a worker queue and give it to some supervisor process, he’s going to farm it out to three separate queues; like Twitter queue, Facebook queue and a Google+ queue. And so basically, “Threading is hard, let’s use queues.” God help you if you really think that’s a smart solution to the threading problem. But, if you wanna get off process and get on to multiple servers, this makes it so that if Twitter is down again, you can get posted to Facebook and to Google+ right away and your response came back immediately because as soon as the supervisor accepted the job to post to all three services, your webpage is done. And you are like, “Oh good, we’ve sent your message.”
JOSH: So when does doing a queue worker processes that take their input off of the queue and put the output on another queue, and then someone else does that. When does that turn into map reduce?
JAMES: That’s a good question. Maybe a whole nother episode.
DAVID: Yeah. There’s good argument that that is the definition of map.
JAMES: Of map, yeah. I have done some of that kind of pipeline processing before, and I just wanted to speak to David’s point about splitting up into multiple workers. Like I had a scenario where I needed one worker that needed a very specialized piece of software, and it only ran in Java, so I just put that worker on a different server that ran JRuby and used JRuby’s ability to call into Java and get done what I needed done…
DAVID: Here’s a queue because I need a different Ruby.
JAMES: Right. Just so I could put it somewhere else and I wouldn’t have to worry. And that server had a very specific install requirement, you know, that was a bunch of pain I didn’t want to deal with in my normal app server and stuff, so let me just separate it out.
JOSH: I’ve totally done that too. [Chuckles]
JAMES: Yeah, it’s a great trick.
DAVID: We have a PDF, when the customer fills them out, we fill in all the fields, we built in a dynamic form fill PDF and we are using the itext library, which are Java based but our Ruby app is MRI 1.0 — and that’s not JRuby. And so yeah, we do the same thing; we send it off to a contract signing service and that’s all running under JRuby and it flattens… using the Java library flattens the contract, conditionally signs it and puts it on the document store. And then it goes back to the… MRI app can then say, “Oh yeah, look it’s done!”
CHUCK: Okay. Well I’m going to wrap it up there. We need to get on to the picks. Just real quick, if you are a new listener, the picks are basically anything that we have, that we wanna recommend that we liked or that we’ve used. It can be anything from technical stuff that makes our workflow easier, to I mean, we’ve had TV shows and movies, Legos, toys, any kind of thing like that picked as well. So let’s go ahead and start with Dave. Dave, what’s your pick this week?
DAVID: Are we doing one or two picks today?
CHUCK: You can do as many as you want.
DAVID: I’m going to do two picks. The first one which is relevant to our interest is Service-Oriented Design with Ruby and Rails by Paul Dix. You can get that in Amazon for about $35 — and it’s absolutely brilliant. He did not adhere to a specific like religious restful interaction versus true SOA yada, yada, yada. Basically says, “No, you know what, here’s the corners you can cut, here’s why our shouldn’t cut them, but here’s when you should and here’s how we are going to do it.” It’s a fantastic book. It’s from Addison-Wesley, so it’s a red and black book. And I’m enjoying that immensely. I’m getting a lot of really good ideas about it. Great idea about it is that if you need to map or reduce or have a composite job, you should have a supervisor worker who does no work other than supervise other workers. And if you have a worker worker, he should not be supervising anything. And so he ends up building a nice tree where there is not data stored in the upper nodes but everything is done down the work… everything is done down the work leaflets… so that’s my first pick.
My second pick is it’s just kind of an ADD-inspired purchase. I was at a bookstore buying a different book and I happen to see Enchantment by Guy Kawasaki. The subtitle is the Art of Changing Hearts, Minds, and Actions. And it’s a fantastic book on basically how not to be an asshole. He talks about kind of what enchantment is, what it means to kind of get motivated, how to get people to… and he gives really obvious things like be accepting of others, don’t judge other peoples values. And then he has really some controversial or interesting advice. He gives three pages on the advice. He says, “You should swear.” And I don’t like to swear. I try not to swear, but he gives three pages on where to swear and how to swear, as a way of increasing the amount of persuasion ability that you use in a room. I don’t wanna call out the entire book based on that one thing, but I thought that was really interesting. The whole book is basically just how to get people engaged and involved, and get them to want to work with you and that’s my other pick.
CHUCK: All right, sounds good. Let’s go ahead and have Avdi go next.
AVDI: All right, so see first of all, there’s an article that just came out on the ThoughtBot blog, the giant robots blog by Joe Farris entitled, “If you gaze into nil, nil gazes also into you.” And this was making rounds a bit today, but I really liked it. If anybody has seen one of my talks that I’m big on eliminating nil where we find it, and replacing it with something more meaningful than nil. And so he goes gives several techniques for replacing nil with something more meaningful and useful — and I completely endorse all of the techniques that he goes over.
Another thing that I’ve been getting some value from, I don’t tread Hacker News, but there’s a service called Hacker Monthly, where they take a few of the top stories from Hacker New over the course of the month and they format them really nicely and put them together into a PDF magazine format, just like 4-5 articles. And they also publish it like in epub and mobi, and stuff like that. And it’s a nice read, and it’s usually pretty good picks, so that’s a nice way of going over some of the top articles in programming over the past month.
CHUCK: All right, that sounds interesting. There are a lot of aggregators like that I would love to read up more on and Hacker News is one of them, but I just don’t have time, so that sounds pretty nice. Josh, go ahead.
JOSH: All right, okay so I think it’s been mentioned in the podcast before in passing them as a pick, but I wanna put in a plug for RubyThere.com that’s a site about Ruby regional conferences, and it talks about what conferences are coming up. And also, which conferences have open CFPs or ‘calls for participation’. So if you wanna speak at a conference, it’s a good place to go and find out where you can submit your proposals to speak. So wanna go to a conference, wanna speak at a conference, rubythere.com. And I looked at it a few minutes ago and the site was down, so I hope it’s up soon by the time this comes up. So that’s one.
The other one is Code for America. And this is a public service organization; it’s programmers doing project that contribute to good and they help things around visibility of government information, transforming how people vote, education system improvements, things about just city governments, things like that. So codeforamerica.org and the Code for America has these fellowship programs where you can go and work for them for a year, and get paid a reasonable amount of money to do something that contributes to the specific good. And they also have the projects are open source and many of them are being done in Rails or Ruby, some in Python and other languages. So they have all these projects and you can actually get involved and contribute to the projects without having to stop your life for a year and you’ll be a good fellow working for them. So there’s plenty of ways that you can contribute just as part of your ordinary open source hacking. So that’s codeforamerica.org.
CHUCK: Was code for America the organization that was on the Change Log podcast last week?
JAMES: Yes it was.
JOSH: oh, it was? I didn’t even heard that podcast yet. There hasn’t been Change Log podcasts in so many weeks. I’ve given up on it.
CHUCK: [Chuckles] Yeah, they had quite a good discussion. Kind of made you think about getting involved.
JOSH: Oh wow. Damn, stole my thunder.
JOSH: [Chuckles] It’s still a good organization to check out and encourage people to get involved.
CHUCK: Yeah, definitely. James, go ahead.
JAMES: So first I just wanted to add an extra recommendation for that article Avdi mentioned. That’s like one of those big things that like you have all these things in computing where you feel like you level up up as soon as you learn it, you get the extra experience and you go up a level. And for me, the learning that was nil like usually an extra case you don’t need was definitely one of those scenarios. Like some of my favorite examples to look or like you’ll find methods where they either return nil or they return an array of elements to operate on, right? So usually in that method, you have like and if-else, like if there’s nothing, then return nil; else, return this array of objects.
And then if you look at the code that calls that, it usually has an if-else. And it’s like, if it returns nil, then I need to do this or otherwise I got the array and I´ll operate on all these elements. And it turns out if you just go in and remove the nil part in the first method, you can also remove the nil part in the next method — most of the time. And that, if you just return an array of elements that you need to operate on and remember that that array maybe empty. So you get an empty array, and then you operate on nothing and nothing is nothing, so you know, it generally just turns out you don’t even need special case… so for me, that was one of those areas where I felt like I leveled up. I read that article after Avdi tweeted about it earlier today, and it’s a great article so you should go check that out. I agree.
And I have to do that because I’m going to recommend some non-code stuff, so you know, that’s me. I always recommend non code stuff, right? That because maybe this is on… I should probably shut up now. I always find like the non-programming side of our lives interesting. Like for example, if you are a programmer and that’s all you know, then you don’t really know anything, right? Programming is useless by itself, right? It’s only when you combine it with other knowledge in other areas does it become a significant skill, right?
DAVID: Hands down the best programmers I’ve ever worked with are guys who had physics or engineering degrees, and then learned to program so that they can apply their physics and engineering knowledge.
JAMES: Right, exactly. If all you know is programming, I mean, what are you going to do? There’s only so many text editors you can build, right? So, anyways, my recommendations this week are for philosophy. I have a bunch of friends who are into philosophy, and they are always nagging me because I’m like the philosophy idiot. I´ll confess that I used to find it very boring and semantic, so I used to get totally turned off with those arguments. So I’ve been on like the eternal search for years to find a sources of learning philosophy that I could stomach down, and I finally found two that I am enjoying.
So my recommendations are first the book is the Philosophy Gym and it’s by Stephen Law. And it’s basically just 25 thought experiments — is how I would describe it. So you know, they give you chapter and it’s like you’re sitting on a couch, you are doing your thing and an alien pops into your head and says, “Oh, by the way, I just thought you should know, you are not really there on Earth anymore where you are actually a brain in a vac in my laboratory and I´ll prove it to you.” And how does that affect your world, kind of thing. And it talks around the different issues and stuff like that. So I find it very easy to get into, like I go to sleep as soon as I start reading definitions about moral relativism and stuff like that, but this book does it in a much more digestible way, so I’ve been enjoying that a lot.
And then the podcast is Philosophy Bites, which is a similar thing. It’s like a generally about a 12-minute podcast where they go through one topic. So one I listened to just yesterday, I think it was on animal rights, using animals on our food and what are the moral implications to that and stuff like that. So it’s simple, it’s sweet, it’s short, and usually enough that I don’t lose interest in it. There are definitely some episodes I found boring, there are a bunch of kind of stuff, English guy, so sometimes they get a little hung up on what did Plato think of soccer version of whatever. It can get kind of boring, but look through the episode list and you´ll be able to pretty much tell the ones if you can follow the logic and the title, then it’s probably something that will be worth listening to, you know? Those are my recommendations. Those are what finally turned me on to philosophy.
DAVID: Very cool.
CHUCK: Yeah, very nice. So my picks, my first pick is actually a podcast; it’s a series of videos called Ask A Ninja. I don’t know if you guys have watched that at all.
CHUCK: It is hilarious. It is just funny. And I’ve really, really, been enjoying it. Every time they put out a new video, I just wind up laughing my head off. And so, that’s one pick that I have. And the other pick that I have is Ruby 1.9. And I have to admit that I’ve been lazy, and I just kind of let everything on Ruby 1.8 forever and ever. And I discovered that to can do RVM use 1.9.2- I think then stable release according to rubylang.org is 2.9.0. and so I did an RVM install and then an RVM use and then I did –default, and so now my default Ruby is Ruby 1.9. The other thing that’s nice about RVM is then you don’t have to do the… when you do your gem installs, you don’t have to run pseudo. And I’ve been using gem sets for all of my… so I guess this is a pick for RVM as well, because I’ve been using gem sets for all of my projects.
But the other nice thing is it just turns out that a lot of the stuff that I would have to include or need by default in 1.8.7 as a gem is already in the core for Ruby 1.9.2. And the most recent example of that is actually FasterCSV. Thanks, James. And I had to do CSV import for a client, and I went and I installed FasterCSV and then I got the error message and uninstalled FasterCSV, and figured out that it was already there and that I could already use it — which I kind of knew — but I didn’t know how to get to it until I did a little bit of Google work. But it’s been really nice, and it’s a lot faster and there are just some excellent features in there that I’m starting to get to know a little bit better.
DAVID: I had a production server break and a costumer called me because I used FasterCSV, and I upgraded it to Ruby 1.9, so thanks, James. You can turn that right back on and say, “Why didn’t you have a unit test on that?” So, I’m the idiot here.
DAVID: [Chuckles] You guys are silent. Yup.
JAMES: No comment.
CHUCK: Yeah, Dave, you are such an idiot. Geez.
CHUCK: All right, we’re going to go ahead and wrap this episode up. We’re just over an hour. I wanna thank our panelists for coming.
DAVID: You guys are so awesome.
CHUCK: Just real quick, we’ll let you know who they are. We have in no particular order, David Brady.
DAVID: No particular order.
DAVID: But I’m first.
CHUCK: Yeah. Avdi Grimm.
AVDI: Happy [inaudible].
CHUCK: James Edward Gray.
JAMES: I love you guys. Group hug!
CHUCK: And Josh Susser.
JOSH: It’s fun as always.
CHUCK: And I’m Charles Max Wood. Now, there are few things that you would probably going to wanna know. First off, there are links to most of the things we’ve talked about at rubyrogues.com. So if you are trying to figure out what’s going on there, you can go to rubyrogues.com, look at the show notes and get all of the information.
We also had Derrick Prior, I think his name is, he actually compiled all of our picks into a gist on GitHub and so I´ll put a link to that in the show notes as well. So if you don’t wanna page through the different picks, then you can just go look at his list. We may add a list like that to the site, and I´ll probably talk to him about that.
JAMES: We’ve got fan! How freaking cool is that?
CHUCK: Yeah, it is freakin cool. You can also get this on iTunes, so just go to iTunes, do a search for Ruby Rogues, we come right up to the top. And leave us a review if you are enjoying the podcast. I’m getting a lot of emails and tweets about the podcast, and so by all means, you can keep those coming. And I need to share more of those with the panelists, but it’s just a lot of fun to hear about that.
And I’ve also had a lot of people adding me on Google+ and I don’t know if you guys have as well, but there are people that I don’t know that I’m pretty sure are listening to this podcast. So, drop me a note or drop the other panelist a note and just let them know, “Hey I’m following you or I’m adding you to my circle and I’m enjoying the podcast.” I think we’d all appreciate that.
DAVID: Agreed. Agreed. Oh, oh! Because I have ADD, I forgot to mention at the beginning of the show that I also have the ADDCasts with Pat Maddox. I’m just throwing that out there so that Pat doesn’t go, “Dude, are we still doing that?” I just forgot to mention.
CHUCK: Yeah, and that is actually a fun show to listen to. About half of it is about code and the other half is about whatever shiny that floats by them and so…
DAVID: [Laughs] Yeah, half of it is crap.
CHUCK: But yeah, anyway so just thanks again for listening. We’ll catch you next week! It will either be a live show from Lonestar Ruby Conf — in which some of us will be there, some of us won’t — or if not that, then we’ll be doing an episode on becoming a developer. So, look forward to that, and we’ll catch you next week.
JAMES: Bye everybody!