雪花秀滋晶美白不好用:paperplanes. Web Operations 101 For Developers

来源:百度文库 编辑:九乡新闻网 时间:2024/04/29 22:58:22

Web Operations 101 For Developers

Posted on 25 Jul 2011 by Mathias Meyer

This post is not about devops, it's not about lean startups, it's not about webscale, it's not about the cloud, and it's not about continuous deployment. Thispost is about you, the developer who's main purpose in life has always been tobuild great web applications. In a pretty traditional world you write code, youwrite tests for it, you deploy, and you go home. Until now.

To tell you the truth, that world has never existed for me. In all of mydeveloper life I had to deal with all aspects of deployment, not just puttingbuild artifacts on servers, but dealing with network outages, faulty networkdrivers, crashing hard disks, sudden latency spikes, analyzing errors comingfrom those pesky crawling bots on that evil internet of yours. I take a lot ofthis for granted, but working in infrastructure and closely with developerstrying to get applications and infrastructure up and running on EC2 has taughtme some valuable lessons to assume the worst. Not because developers are stupid, butbecause they like to focus on code, not infrastructure.

But here's the deal: your code and all your full-stack and unit tests is worthsquat if they're not running out there on some server or infrastructure stacklike Google Apps or Heroku. Without running somewhere in production, your codedoesn't generate any business value, it's just a big pile of ASCII or UTF-8characters that cost a lot of money to create, but didn't offer any return ofinvestment yet.

Love Thy Infrastructure

Operations isn't hard, but necessary. You don't need to know everything aboutoperations to become fluent in it, you just have to know enough to start andknow how to use Google.

This is my collective dump from the last years of working both as a developerand that guy who does deployments and manages servers too. Most are lessons Ilearned the hard way, others just seemed logical to me when I learned about themthe first time around.

Between you and me, having this skill set at hand makes you a much more valuabledeveloper. Being able to analyze any problem in production and at least having abasic skill set to deal with it makes you a great asset for companies andclients to hold on to. Thought you should know, but I digress.

The most important lesson I can tell you right up front: love yourinfrastructure, it's the muscles and bones of your application, whereas yourcode running on it is nothing more than the skin.

Without Infrastructure, No-one Will Use Your Application

Big surprise. For users to be able to enjoy your precious code, it needs to runsomewhere. It needs to run on some sort of infrastructure, and it doesn't matterif you're managing it, or if you're paying another company to take care of it foryou.

Everything Is Infrastructure

Every little piece of software and hardware that's necessary to make yourapplication available to users is infrastructure. The application server servingand executing your code, the web server, your email delivery provider, theservice that tracks errors and application metrics, the servers or virtualmachines your services are running on.

Every little piece of it can break at any time, can stall at any time. The morepieces you have in your application puzzle, the more breaking points you have.And everything that can break, will break. Usually not all at once, but mostcertainly when it's the least expected, or just when you really need yourapplication to be available.

On Day One, You Build The Hardware

Everything starts with a bare metal server, even that cloud you've heard so muchabout. Knowing your way around everything that's related to setting up a fullrack of servers on a single day, including network storage a fully configuredswitch with two virtual LANs and a master-slave database setup using a RAID 10a bunch of SAS drives might not be something you need every day, but it surecomes in handy.

The good news is the internet is here for you. You don't need to know everythingabout every piece of hardware out there, but you should be able to investigatestrengths and weaknesses, when an SSD is an appropriate tool to use, and whenSAS drives will kick butt.

Learn to distinguish the different levels of RAID, why having an additional filesystem buffer on top of a RAID that doesn't have a backup battery for its own,internal write buffer is a bad idea. That's a pretty good start, and will makedecisions much easier.

The System

Do you know what swap space is? Do you know what happens when it's used by theoperating system, and why it's actually a terrible thing and gives a false senseof security? Do you know what happens when all available memory is exhausted?

Let me tell you:

  • When all available memory is allocated, the operating system starts swapping out memory pages to swap space, which is located on disk, a very slow disk, slow like a snail compared to fast memory.
  • When lots of stuff is written to and read from swap space on disk, I/O wait goes through the roof, and processes start to pile up waiting for their memory pages to be swapped out to or read from disk, which in turn increases load average, and almost brings the system to a screeching halt, but only almost.
  • Swap is terrible because it gives you a false sense of having additional resources beyond the available memory, while what it really does is slowing down performance in a way that makes it almost impossible for you to log into the affected system and properly analyze the problem.

This is basically operations level on the operating system level. It's not muchyou need to know here, but in my opinion it's essential. Learn about the mostimportant aspects of a Unix or Linux system. You don't need to know everything,you don't need to know the specifics of Linux' process scheduler or theunderlying datastructure used for virtual memory. But the more you know, themore informed your decisions will be when the rubber hits the road.

And yes, I think enabling swap on servers is a terrible idea. Let processescrash when they don't have any resources left. That at least will allow you toanalyze and fix.

Production Problems Don't Solve Themselves

Granted, sometimes they do, but you shouldn't be happy about that. You should bewilling to dig into whatever data you have posthumous to find whatever went wrong,whatever caused a strange latency spike in database queries, or caused anunusually high amount of errors in your application.

When a problem doesn't solve itself though, which is certainly the common case,someone needs to solve it. Someone needs to look at all the available data tofind out what's wrong with your application, your servers or the network.

This person is not the unlucky operations guy who's currently on call, becauselet's face it, smaller startups just don't have an operations team.

That person is you.

Solve Deployment First

When the first line of code is written, and the first piece of your applicationis ready to be pushed on a server for someone to see, solve the problem ofdeployment. This has never been easier than it is today, and being able to pushincremental updates from then on speeds up development and the customer feedbackcycle considerably.

As soon as you can, build that Capfile, Ant file, or whatever build anddeployment tools you're using, set up servers, or set up your projecton an infrastructure platform like Scalarium,Heroku, Google Apps, ordotCloud. The sooner you solve this problem, the easierit will be to finally push that code of yours into production for everyone touse. I consider application deployment a solved problem. There's no reason whyyou shouldn't have it in place even in the earliest stages of a project.

The more complex a project gets over even just its initial lifecycle the easierit will be to add more functionality to an existing deployment setup instead ofhaving to build everything from scratch.

Automate, Automate, Automate

Everything you do by hand, you should only be doing once. If there's any chancethat particular action will be repeated at some point, invest the time to turnit into a script. It doesn't matter if it's a shell, a Ruby, a Perl, or a Pythonscript. Just make it reusable. Typing things into a shell manually, or updatingconfiguration files with an editor on every single server is tedious work, workthat you shouldn't be doing manually more than once.

When you automate something once, it not only greatly increases execution speed thesecond and third time around, it reduces the chance of failure, of missing that oneimportant step.

There's an abundance of tools available to automate infrastructure, hand-writtenscript are only the simplest part of it. Once you go beyond managing just one ortwo servers, tools like Chef,Puppet andMCollective come in very handy toautomate everything from setting up bare servers to pushing out configurationchanges from a single point, to deploying code. Everything should be properlyautomated with some tool. Ideally you only use one, but looking at Chef andPuppet, both have their strength and weaknesses.

Changes in Chef aren't instant, unless you use the command line tool knife,which assumes SSH access to all servers you're managing. The bigger yourorganizations the less chance you'll have to be able to access all machines viaSSH. Instant tools like mCollective that work based on a push agent system, aremuch better for these instant kinds of activities.

It's not important what kind of tool you use to automate, what's important isthat you do it in the first place.

By the way, if your operations team restricts SSH access to machines fordevelopers, fix that. Developers need to be able to analyze and fix incidentsjust like the operations folks do. There's no valid point in denying SSH accessto developers. Period.

Introduce New Infrastructure Carefully

Whenever you add a new component, a new feature to an application, you add a newpoint of failure. Be it a background task scheduler, a messaging queue, an imageprocessing chain or asynchronous mail delivery, it can and it will fail.

It's always tempting to add shiny new tools to the mix. Developers are prone totrying out new tools even though they've not yet fully proven themselves inproduction, or experience running them is still sparse. It's a good thing in oneway, because without people daring to use new tools everyone else won't be ableto learn from their experiences (you do share those experiences, do you?).

But on the other hand, you'll live the curse of the early adopter. Instead ofbenefiting from existing knowledge, you're the one bringing the knowledge intoexistence. You'll experience all the bugs that are still lurking in the darkercorners of that shiny new database or message queue system. You'll spend timedeveloping tools and libraries to work with the new stuff, time you could justas well be spending working on generating new business value by using existingtools that do the job similarly well. If you do decide for a new tool, beprepared to degrade back to other tools in the case of failure.

No matter if old or new, adding more infrastructure always has the potential formore things to break. Whenever you add something, be sure to know what you'regetting yourself into, be sure to have fallback procedures in place, be sureeveryone knows about the risks and the benefits. When something that's stillpretty new breaks, you're usually on your own.

Make Activities Repeatable

Every activity in your application that causes other, dependent activities to beexecuted, needs to be repeatable, either by the user, or through some sort ofadministrative interface, or automatically if feasible. Think user confirmationemails, generating monthly reports, background tasks like processing uploads.Every activity that's out of the normal cycle of fetching records from adatasource and updating them is bound to fail. Heck, even that cycle will failat some point due to some odd error that only comes up every once in a bluemoon.

When an activity is repeatable, it's much easier to deal with outages of singlecomponents. When it comes back up, simply re-execute the tasks that got stuck.

This, however, requires one important thing: every activity must be idempotent.It must have the same outcome no matters how often it's being run. It must knowwhat steps were already taken before it broke the last time around. Whatever'salready been done, it shouldn't be done again. It should just pick up where itleft off.

Yes, this requires a lot of work and care for state in your application. Buttrust me, it'll be worth it.

Use Feature Flips

New features can cause joy and more headaches. Flickr was one of the first toadd something called feature flips, a simple way to enable and disable featuresfor all or only specific users. This way you can throw new features onto yourproduction systems without accidentally enabling it for all users, you cansimply allow a small set of users or just your customer to use it and to playwith it.

What's more important though, when a feature breaks in production for somereason, you can simply switch it off, disabling traffic on the systems involved,allowing you to take a breether and analyze the problem.

Feature flips come in many flavors, the simplest approach is to just use aconfiguration file to enable or disable them. Other approaches use a centralizeddatabase like Redis for that purpose, which has an added benefit for other partsof your application, but also adds new infrastructure components and therefore,more complexity and more points of failure.

Fail And Degrade Gracefully

What happens when you unplug your database server? Does your application throwin the towel by showing a 500 error, or is it able to deal with the situationand show a temporary page informing the user of what's wrong? You should try itand see what happens.

Whenever something non-critical breaks, your application should be able to dealwith it without anything else breaking. This sounds like an impossible thing todo, but it's really not. It just requires care, care your standard unit testswon't be able to deliver, and thinking about where you want a breakage to leakto the user, or where you just ignore it, picking up work again as soon as thefailed component becomes available again.

Failing gracefully can mean a lot of things, there's things that directly affectuser experience, a database failure comes to mind, and things that the user willnotice only indirectly, e.g. through delays in delivering emails or fetchingdata from an external service like Twitter, RSS feeds and so on.

When a major component in your application fails, a user will most likely beunable to use your application at all. When your database latency increasesmanifold, you have two options. Try to squeeze through as much as you can,accepting long waits on your user's side, or you can let him know that it'scurrently impossible to serve him in an acceptable time frame, and that you'reactively working on fixing or improving the situations. Which you should, eitherway.

Delays in external services or asynchronous tasks are much harder for a user tonotice. If fetching data from an external source, like an API, directly affectsyour site's latency, there's your problem.

Noticing problems in external services requires two things: monitoring andmetrics. Only by tracking queue sizes, latency for calls to external services,mail queues and all things related to asynchronous tasks will you be able totell when your users are indirectly affected by a problem in yourinfrastructure.

After all, knowing is half the battle.

Monitoring Sucks, You Need It Anyway

I've written in abundance on the virtues of monitoring, metrics andalerting. Ican't say it enough how important having a proper monitoring and metrics gatheringsystem in place is. It should be by your side from day one of any testing deployment.

Set up alerts for thresholds that seem like a reasonable place to start to you.Don't ignore alerting notifications, once you get into that habit, you'll missthat one important notification that's real. Instead, learn about your systemand its thresholds over time.

You'll never get alerting and thresholds right the first time, you'll adapt overtime, identifying false negatives and false positives, but if you don't have asystem in place at all, you'll never know what hit your application or yourservers.

If you're not using a tool to gather metrics likeMunin,Ganglia, New Relic, orcollectd, you'll be in for a big surprise once your application becomesunresponsive for some reason. You'll simply never find out what the reason wasin the first place.

While Munin has basic built-in alerting capabilities, chances are you'll addsomething like Nagios orPagerDuty to the mix for alerting.

Most monitoring tools suck, you'll need them anyway.

Supervise Everything

Any process that's required to be running at any time needs to be supervised.When something crashes be sure there's an automated procedure in place that willeither restart the process or notify you when it can't do so, degradinggracefully. Monit,God, bluepill,supervisord, RUnit, thenumber of tools available to you is endless.

Micromanaging people is wrong, but processes need that extra set of eyes on themat all times.

Don't Guess, Measure!

Whatever directly affects your users' experience affects your business. Whenyour site is slow, users will shy away from using it, from generating revenueand therefore (usually) profit.

Whenever a user has to wait for anything, they're not willing to wait forever.If an uploaded video takes hours to process, they'll go to the next videohosting site. When a confirmation email takes hours to be delivered, they'llcheck out your competitor, taking the money with them.

How do you know that users have to wait? Simple, you track how long things inyour application take, how many tasks are currently stuck in your processingqueue, how long it took to process an upload. You stick metrics on anythingthat's directly or indirectly responsible for generating business value.

Without having a proper system to collect metrics in place, you'll be blind.You'll have no idea what's going inside your application at any given time.Since Coda Hale's talk "MetricsEverywhere"at CodeConf and the release of his metrics library forScala, an abundance of libraries fordifferent languages has popped up left and right. They make it easy to includetimers, counters, and other types of metrics into your application,allowing you to instrument code where you see fit. Independently, Twitter haslead the way by releasing Ostrich, theirown Scala library to collect metrics. The tools are here for you. Use them.

The most important metrics should be easily accessible on some sort ofdashboard. You don't need a big fancy screen in your office right away, acanonical place, e.g. a website including the most important graphs and numbers,where everyone can go and see what's going on with a glance is a good start.Once you have that in place, the next step towards a company-visible dashboardis simple buying a big-ass screen.

All metrics should be collected in a tool like Ganglia, Munin or something else.These tools make analysis of historical data easy, they allow you to makepredictions or correlate the metrics gathered in your applications to otherstatistics like CPU, memory usage, I/O waits, and so on.

The importance of monitoring and metrics cannot be stressed enough. There's noreason why you shouldn't have it in place. Setting up Munin is easy enough,setting up collection using an external service like New Relic orScout is usually even easier.

Use Timeouts Everywhere

Latency is your biggest enemy in any networked environment. It creeps up on youlike the shadow of the setting sun. There's a whole bunch of reasons why, e.g.database queries will suddenly see a spike in execution time, or externalservices suddenly take forever to answer even the simplest requests.

If your code doesn't have appropriate timeouts, requests will pile up and maybenever return, exhausting available resources (think connection pools) fasterthan Vettel does a round in Monte Carlo.

Amazon for example has internal contracts. Going to their home page involvesdozens of requests to internal services. If any one of them doesn't respond in atimely manner, say 300 ms, the application serving the page will render a staticpiece snippet instead, but thereby decreasing the chance of selling something,directly affecting business value.

You need to treat every call to an external resource as something that can takeforever, something that potentially blocks an application server processforever. When an application server process or thread is blocked, it can't serveany other client. When all processes and threads lock up waiting for a resource,your website is dead.

Timeouts make sure that resources are freed and made available again after agrace period. When a database query takes longer than usual, not only does yourapplication need to know how to handle that case, but your database needs to. Ifyour application has a timeout, but your database will happily keep sorting thosemillions of records in a temp file on disk, you didn't gain a lot. If twodependent resources are within your hands, both need to be aware of contractsand timeouts, both need to properly free resources when the request couldn't beserved in a timely manner.

Use timeouts everywhere, but know how to handle them when they occur, know whatto tell the user when his request didn't return quickly enough. There is nogolden rule what to do with a timeout, it depends not just on your application,but on the specific use case.

Don't Rely on SLAs

The best service fails at some point. It will fail in the most epic waypossible, not allowing any user to do anything. This doesn't have to be yourservice. It can be any service you directly or indirectly rely on.

Say, your code runs on Heroku. Heroku's infrastructure runs on Amazon's EC2.Therefore Heroku is prone to problems with EC2. If a provider like Heroku tellsyou they have a service level agreement in place that guarantees a minimumamount of availability per month or per year, that's worth squat to you, becausethey in turn rely on other external services, that may or may not offerdifferent SLAs. This is not specific to Heroku, it's just an obvious example.Just because you outsourced infrastructure doesn't mean you're allowed to stopcaring.

If your application runs directly on EC2, you're bound by the same problem. Thesame is true for any infrastructure provider you rely on, even a big hostingcompany where your own server hardware is colocated.

They all have some sort of SLA in place, and they all will screw you over withthe terms of said SLA. When stuff breaks on their end, that SLA is not worth asingle dime to you, even when you were promised to get your money back. It willnever make up for lost revenue, for lost users and decreased uptime on your end.You might as well stop thinking about them in the first place.

What matters is what procedures any provider you rely on has in place in case ofa failure. The important thing for you as one of their users is to not be leftstanding in the rain when your hosting world is coming close to an end. Acommunicative provider is more valuable than one that guarantees an impossibleamount of availability. Things will break, not just for you. SLAs give you thatfalse sense of security, the sense that you can blame an outage on someone else.

For more on this topic, Ben Black has written atwopart seriesaptly named "Service Level Disagreements".

Know Your Database

You should know what happens inside your database when you execute any query.Period. You should know where to look when a query takes too long, and youshould know what commands to use to analyze why it takes too long.

Do you know how an index is built? How and why your database picks one index overanother? Why selecting a random record based on the wrong criteria will killyour database?

You should know these things. You should read "High Performance MySQL", or"Oracle Internals", or "PostgreSQL 9.0 High Performance". Sorry, I didn't meanto say you should, I meant you must read them.

Love Your Log Files

In case of an emergency, a good set of log files will mean the world to you.This doesn't just include the standard set of log files available on a Unixsystem. It includes your application and all services involved too.

Your application should log important events, anything that may seem useful toanalyze an incident. Again, you'll never get this right the first time around,you'll never know up front all the details you may be interested in later. Adaptand improve, add more logging as needed. It should allow you to tune the logverbosity at runtime, either by a using a feature switch or by accepting a Unixsignal.

Separate request logging from application logging. Data on HTTP requests is justas important as application logs, but it's easier if you can sift through themindependently, they're also a lot easier to aggregate for services like Syslogor Loggly when they're on their own.

For you Rails developers out there: using Rails.logger is not an acceptablelogging mechanism. All your logged statements will be intermingled with Railsnext to unusable request logging output. Use a separate log file for anythingthat's important to your application.

Just like you should stick metrics on all things that are important to yourbusiness, log additional information when things get out of hand. Correlatinglog files with metrics gathered on your servers and in your application is anincredibly powerful way of analyzing incidents, even long after they occurred.

Learn the Unix Command Line

In case of a failure, the command line will be your best friend. Knowing theright tools to quickly sift through a set of log files, being able to find andset certain kernel parameters to adjust TCP settings, knowing how get the mostimportant system statistics with just a few commands, and knowing where to lookfor a specific service's configuration. All these things are incredibly valuablein case of a failure.

Knowing your way around a Unix or Linux system, even with just a basic toolsetis something that will make your life much easier, not just in operations, butalso as a developer. The more tools you have at your disposal, the easier itwill be for you to automate tasks, to not be scared of operations in general.

In times of an emergency, you can't afford to argue that your favorite editor isnot installed on a system, you use what's available.

At Scale, Everything Breaks

Working at large scale is nothing anyone should strive for, it's a terribleburden, but an incredibly fascinating one. The need for scalability evolves overtime, it's nothing you can easily predict or assume without knowing all thedetails, parameters and the future. Out of all three, at least one is 100% guesswork.

The larger your infrastructure setup gets, the more things will break. The moreservers you have, the larger the number of servers being not available at anytime. That's nothing you need to respect right from the get go, it's somethingto keep in mind.

No service that's working at a larger scale was originally designed for it. Thecode and infrastructure were adapted, the services grew over time, and theyfailed a lot. Something to think about when you reach for that awesome scalabledatabase before even having any running code.

Embrace Failure

The bottom line of everything is, stuff breaks, everything breaks at differentscale. Embrace breakage and failure, it will help you learn and improve yourknowledge and skill set over time. Analyze incidents using the data available toyou, fix the problem, learn your lesson, and move on.

Don't embrace one thing though: never let a failure happen again if you knowwhat caused it the first time around.

Web operations is not solely related to servers and installing softwarepackages. Web operations involves everything required to keep an applicationavailable, and your code needs to play along.

Required Reading

As 101s go, this is a short overview of what I think makes up for a good starterset of operations skills. If you don't believe or trust me (which is a goodthing), here's a list of further reading for you. By now, I consider most ofthese required reading even. The list isn't long, mind you. The truth as oftoday is still that you learn the most out of personal experience on productionsystems. Both require one basic skill though: you have to want to learn.

  • Release It! - A must read, that's all I can say. It's an incredible resource stemming from years of production. A must read, no excuses.
  • Web Operations: Keeping the Data on Time - The best summary on all things operations available today. If you read one book, read this, and the previous one (that's two books, I know)
  • Pragmatic Project Automation (oldie, but goldie, this book was an eye-opener to me)
  • The Art of Capacity Planning
  • Building Scalable Websites
  • High Performance MySQL, 2nd. Edition
  • How Complex Systems Fail
  • On Designing and Deploying Internet-Scale Services