遗失的吊坠:Twitter's New Search Architecture

来源:百度文库 编辑:九乡新闻网 时间:2024/04/27 19:03:57

Twitter's New Search Architecture

If we have done a good job then most of you shouldn’t have noticed that we launched a new backend for search on twitter.comduring the last few weeks! One of our main goals, but also biggestchallenges, was a smooth switch from the old architecture to the newone, without any downtime or inconsistencies in search results. Read onto find out what we changed and why.

Twitter’s real-time searchengine was, until very recently, based on the technology that Summizeoriginally developed. This is quite amazing, considering the explosivegrowth that Twitter has experienced since the Summize acquisition.However, scaling the old MySQL-based system had become increasinglychallenging.

The new technology

About 6 months ago, wedecided to develop a new, modern search architecture that is based on ahighly efficient inverted index instead of a relational database. Sincewe love Open Source here at Twitter we chose Lucene, a search engine library written in Java, as a starting point.

Ourdemands on the new system are immense: With over 1,000 TPS(Tweets/sec) and 12,000 QPS (queries/sec) = over 1 billion queries perday (!) we already put a very high load on our machines. As we want thenew system to last for several years, the goal was to support at leastan order of magnitude more load.

Twitter is real-time, so oursearch engine must be too. In addition to these scalabilityrequirements, we also need to support extremely low indexing latencies(the time it takes between when a Tweet is tweeted and when it becomessearchable) of less than 10 seconds. Since the indexer is only one partof the pipeline a Tweet has to make it through, we needed the indexeritself to have a sub-second latency. Yes, we do like challenges here atTwitter! (btw, if you do too: @JoinTheFlock!)

Modified Lucene

Luceneis great, but in its current form it has several shortcomings forreal-time search. That’s why we rewrote big parts of the core in-memorydata structures, especially the posting lists, while still supportingLucene’s standard APIs. This allows us to use Lucene’s search layeralmost unmodified. Some of the highlights of our changes include:

  • significantly improved garbage collection performance
  • lock-free data structures and algorithms
  • posting lists, that are traversable in reverse order
  • efficient early query termination

Webelieve that the architecture behind these changes involves severalinteresting topics that pertain to software engineering in general (notonly search). We hope to continue to share more on these improvements.

And,before you ask, we’re planning on contributing all these changes backto Lucene; some of which have already made it into Lucene’s trunk andits new realtime branch.

Benefits

Now that the system isup and running, we are very excited about the results. We estimate thatwe’re only using about 5% of the available backend resources, whichmeans we have a lot of headroom. Our new indexer could also indexroughly 50 times more Tweets per second than we currently get! And thenew system runs extremely smoothly, without any major problems orinstabilities (knock on wood).

But you might wonder: Fine, it’sfaster, and you guys can scale it longer, but will there be any benefitsfor the users? The answer is definitely yes! The first difference youmight notice is the bigger index, which is now twice as long -- withoutmaking searches any slower. And, maybe most importantly, the newsystem is extremely versatile and extensible, which will allow us tobuild cool new features faster and better. Stay tuned!

The engineers who implemented the search engine are: Michael Busch, Krishna Gade, Mike Hayes, Abhi Khune, Brian Larson, Patrick Lok, Samuel Luckenbill, Jake Mannix, Jonathan Reichhold.