Twitter, an Evolving Architecture

Twitter, an Evolving ArchitecturePosted by Abel Avram on Jun 26, 2009Community Topics Tags , , , , Evan Weaver, Lead Engineer in the Services Team at Twitter, who’s primarily job is opti

kzjay

1044人浏览 · 2009-12-15 09:46:00

kzjay · 2009-12-15 09:46:00 发布

Twitter, an Evolving Architecture

Posted by Abel Avram on Jun 26, 2009

Community

Topics

Cache

Each tweet is tracked in average by 126 users, so there is clearly a need for caching. In the original configuration, only the API had a page cache that was invalidated each time a tweet was coming from an user, the rest of the application being cacheless:

The first architectural change was to create a write-through Vector Cache containing an array of tweet IDs which are serialized 64 bit integers. This cache has a 99% hit rate.

The second change was adding another write-through Row Cache containing database records: users and tweets. This one has a 95% hit rate and it is using Nick Kallen’s Rails plug-in called Cache Money. Nick is a Systems Architect at Twitter.

The third change was introducing a read-through Fragment Cache containing serialized versions of the tweets accessed through API clients which could be packaged in JSON, XML or Atom, with the same 95% hit rate. The fragment cache “consumes the vectors directly, and if a serialized fragment is currently cached it doesn’t load the actual row for the tweet you are trying to see so it short-circuits the database the vast majority of times”, said Evan.

Yet another change was creating a separate cache pool for the page cache. According to Evan, the page cache pool uses a generational key scheme rather than direct invalidation because clients can

send HTTPs if-modified-since and put any time stamp they want in the request path and we need to slice the array and present them with only the tweets they want to see but we don’t want to track all the possible keys that the clients have used. There was a big problem with this generational scheme because it didn’t delete all the invalid keys. Each page that was added which corresponding to the number of tweets people were receiving would push out valid data in the cache and it turned out that our cache only had a 5 hour effective life time because of all these page caches flowing through.

When the page cache was moved into its own pool, the cache misses dropped about 50%.

This is the current cache scheme employed by Twitter:

Since 80% of the Twitter traffic comes through the API, there are 2 additional levels of cache, each servicing up to 95% of the requests coming from the preceding layer. The overall cache changes, in total between 20 and 30 optimizations, brought a

10x capacity improvement, and it would have been more but we hit another bottleneck at that point … Our strategy was to add the read-through cache first, make sure it invalidates OK, and then move to a write-through cache and repair it online rather than destroying it every time a new tweet ID comes in.

Message Queue

Since, on average, each user has 126 followers, it means there are 126 messages placed in the queue for each tweet. Beside that, there are times when the traffic peaks, as it was during Obama’s inauguration when it reached several hundreds of tweets/second or tens of thousands messages into the queue, 3 times the normal traffic at that time. The MQ is meant to take the peak and disperse it over time so they would not have to add lots of extra hardware. Twitter’s MQ is simple: based on Memcached protocol, no ordering of jobs, no shared state between servers, all is kept in RAM and it is transactional.

The first implementation of the MQ was using Starling, written in Ruby, and did not scale well especially because Ruby’s GC which is not generational. That lead to MQ crashes because at some point the entire queue processing stopped for the GC to finish its job. A decision was made to port the MQ to Scala which is using the more mature JVM GC. The current MQ is only 1,200 lines and it runs on 3 servers.

Memcached Client

The Memcached client optimization was intended to optimize cluster load. The current client used is libmemcached, Twitter being its most important user and contributor to the code base. Based on it, the Fragment Cache optimization over one year led to a 50x increase in page requests served per second.

Because of poor request locality, the fastest way to deal with requests is to precompute data and store it on network RAM, rather than recompute it on each server when necessary. This approach is used by the majority of Web 2.0 sites running almost completely directly from memory. The next step is “scaling writes, after scaling reads for one year. Then comes the multi co-location issue” according to Evan.

The slides of the QCon presentation have been published on Evan’s site.

///

Twitter是目前为止最大的Ruby on Rails应用，几个月间页面点击由0增长到几百万，现在的Twitter比今年月快了10000%

平台
Ruby on Rails
Erlang
MySQL
Mongrel
Munin
Nagios
Google Analytics
AWStats
Memcached

状态
成千上万的用户，真实数量保密
每秒钟600请求
每秒钟平均200-300个连接，峰值为800个连接
MySQL每秒钟处理2，400个请求
180个Rails实例，使用Mongrel作为Web服务器
1个MySQL服务器(one big 8 core box)和1个slave用于只读的统计和报告
30+进程用于处理其余的工作
8台Sun X4100s
Rails在200毫秒内处理一个请求
花费在数据库里的平均时间是50-100毫秒
超过16GB的memcached

架构
1，遇到非常常见的伸缩性问题
2，最初Twitter没有监听，没有图，没有统计，这让解决问题非常困难。后来添加了Munin和Nagios。在Solaris上使用工具有点困难，虽然有Google Analytics但是页面没有loading所以它没什么用
3，大量使用memcached作缓存
-例如，如果获得一个count非常慢，你可以将count在1毫秒内扔入memcached
-获取朋友的状态是很复杂的，这有安全等其他问题，所以朋友的状态更新后扔在缓存里而不是做一个查询。不会接触到数据库
-ActiveRecord对象很大所以没有被缓存。Twitter将critical的属性存储在一个哈希里并且当访问时迟加载
-90%的请求为API请求。所以在前端不做任何page和fragment缓存。页面非常时间敏感所以效率不高，但Twitter缓存了API请求
4，消息
-大量使用消息。生产者生产消息并放入队列，然后分发给消费者。Twitter主要的功能是作为不同形式(SMS，Web，IM等等)之间的消息桥
-使用DRb，这意味着分布式Ruby。有一个库允许你通过TCP/IP从远程Ruby对象发送和接收消息，但是它有点脆弱
-移到Rinda，它是使用tuplespace模型的一个分享队列，但是队列是持久的，当失败时消息会丢失
-尝试了Erlang
-移到Starling，用Ruby写的一个分布式队列
-分布式队列通过将它们写入硬盘用来挽救系统崩溃。其他大型网站也使用这种简单的方式
5，SMS通过使用第三方网关的API来处理，它非常昂贵
6，部署
-Twitter做了一次review并推出新的mongrel服务器，还没有优雅的方式
-如果mongrel服务器替换了则一个内部错误抛给用户
-所以的服务器一次杀死。没有使用rolling blackout方式因为消息队列状态保持在mongrel里，这将导致剩余的mongrel被堵塞
7，误用
-系统经常宕机，因为人们疯狂的添加任何人为朋友，24小时内有9000个朋友，这将让站点崩溃
-构建工具来检测这些问题，这样你可以找到何时何地发生这些错误
-无情的删除这些用户
8，分区
-将来计划分区，目前还没有。当前所做的改变已经足够
-分区的计划基于时间，而不是用户，因为大部分请求都是本地的
-由于memoization分区会很难。Twitter不能保证只读的操作真的为只读，有可能写入一个只读的slave，这很糟糕
9，Twitter的API流量是Twitter站点的10倍
-Twitter所做的最重要的事情就是API
-保持服务简单允许开发人员在Twitter的基础组织上构建一些比Twitter自己所想到的更好的主意。例如，Twitterrific是一个使用Twitter优美的方式

学到的东西
1，和社区交流。不要隐藏并尝试自己解决所有问题。如果你提问，有许多聪明的人士愿意帮忙
2，将你的伸缩计划当成一个商业计划，聚集一帮顾问来帮助你
3，自己构建它。Twitter花费大量时间来尝试其他人的似乎可以工作的解决方案，但是失败了。自己构建一些东西会更好，这样你至少可以控制它并且构建你需要的特性
4，在用户的限度上构建。人们可能尝试弄垮你的系统。提高理由的限度和检测机制来保护你的系统不被杀死
5，不要让数据库成为首要瓶颈，并不是所有东西都需要一个很大的join，缓存数据，考虑其他创造性的方式来获得结果。一个好例子在里Twitter, Rails, Hammers, and 11,000 Nails per Second谈到
6，让你的应用一开始就很容易分区。这样你会一直有一种方式来伸缩你的系统
7，认知你的系统是很慢的，马上添加报告来跟踪问题
8，优化数据库
-索引所有东西，Rails不会为你做这件事
-解释你的查询是怎样运行的，索引可能不是按你想像的去做
-大量的非常规化。例如，Twitter一起存储用户ID和朋友ID，这预防了大量的开销昂贵的join
9，缓存所有东西，个别的ActiveRecord对象目前没有被缓存。目前查找已经足够快
10，测试一切
-你想知道当你部署时一起工作正常
-Twitter现在有一个完整的test suite。所以当缓存失效时Twitter可以在go live之前找到问题
11，使用异常提示和异常日志来获得立即的错误提示，这样你可以发现正确的方式
12，不要做傻事
-伸缩改变了傻东西
-尝试一次加载3000个朋友到内存中可能带来服务器崩溃，但是当只有4个朋友时它工作的很好
13，大部分性能不是来自语言，而是来自应用设计
14，通过创建一个API来让你的站点开放服务。Twitter的API是它成功的一个大原因。它允许用户创建一个扩展和生态系统。你可以从不做你的用户可以做的工作，这样你就不会有创造性。所以开发你的系统并且让其他人将他们的应用与你的应用集成变容易