Scalability.Rules.notes
图书”“列举了易扩展的web应用在设计,开发,实施,运维中的50个关键的实践经验,其中很多经验都已经脍炙人口,在很多web应用中都有体现,如分库分表,cache,LB,异步消息等。
书作者都有大型互联网公司工作的经验,在最后章节对所有的50个rule按照重要级别和实施成本作了分类;如下我摘录一些比较常见的rule按照类型进行一下分类,如属于设计范畴或运维范畴等。
设计原则
- Rule 1—Don’t Over-engineer the Solution — 适合的是最好的
- What: Guard against complex solutions during design.
- How to use: Resist the urge to over-engineer solutions by testing ease of understanding with fellow engineers.
- Why: Complex solutions are costly to implement and have excessive long-term costs.
- Key takeaways: Systems that are overly complex limit your ability to scale. Simple systems are more easily and cost effectively maintained and scaled.
- Rule 2—Design Scale into the Solution (D-I-D Process) — 在设计阶段就考虑scale成本最低并最有效率
How to use:
+Design for 20x capacity.
+Implement for 3x capacity.
+Deploy for ~1.5x capacity.
架构设计
- Rule 43—Communicate Asynchronously As Much As Possible — 使用可信赖的消息队列
- What: Use asynchronous instead of synchronous communication as often as possible.
- When to use: Consider for all calls between services and tiers.
- Why: Synchronous calls stop the entire program’s execution waiting for a response, which ties all the services and tiers together resulting in cascading failures.
- Key takeaways: Use asynchronous communication techniques to ensure that each service and tier is as independent as possible. This allows the system to scale much farther than if all components are closely coupled together.
- Rule 29—Failing to Design for Rollback Is Designing for Failure — 减少失败就是提高可用性
- What: Always have the ability to roll back code.
- Key takeaways: Don’t accept that the application is too complex or that you release code too often as excuses that you can’t roll back. No sane pilot would take off in an airplane without the ability to land, and no sane engineer would roll code that they could not pull back off in an emergency.
- Rule 37—Never Trust Single Points of Failure — 增加可用性,还可以提高用户网络体验
- What: Never implement and always eliminate single points of failure.
- When to use: During architecture reviews and new designs.
- How to use: Identify single instances on architectural diagrams. Strive for active/active configurations.
- Why: Maximize availability through multiple instances.
- Key takeaways: Strive for active/active rather than active/ passive solutions. Use load balancers to balance traffic across instances of a service. Use control services with active/passive instances for patterns that require singletons.
Web开发设计
- Rule 21—Use Expires Headers — cache in client
- What: Use Expires headers to reduce requests and improve the scalability and performance of your system.
- Rule 22—Cache Ajax Calls — cache is king; cache in multiple ties
- Rule 23—Leverage Page Caches
- Rule 24—Utilize Application Caches
- Rule 25—Make Use of Object Caches
- Rule 26—Put Object Caches on Their Own “Tier” — don’t abuse cache
- Rule 4—Reduce DNS Lookups — volecity大会中介绍过很多类似的方法,提高web页面相应速度
- What: Reduce the number of DNS lookups from a user perspective.
- When to use: On all Web pages where performance matters.
- How to use: Minimize the number of DNS lookups required to download pages, but balance this with the browser’s limitation for simultaneous connections.
- Why: DNS lookups take a great deal of time, and large numbers of them can amount to a large portion of your user experience.
- Rule 39—Ensure You Can Wire On and Off Functions — 通过文件或者数据库配置,可以针对用户打开或者关闭某个服务,类似oracle 11g中的invsisiable index
- What: Create a framework to disable and enable features of your product.
- When to use: Risky, very high use, or shared services that might otherwise cause site failures when slow to respond or unavailable.
- How to use: Develop shared libraries to allow automatic or on-demand enabling and disabling of services.
- Why: Graceful failure (or handling failures) of transactions can keep you in business while you recover from the incident and problem that caused it.
- Key takeaways: Implement Wire On/Wire Off frameworks whenever the cost of implementation is less than the risk and associated cost of failure.Work to develop shared libraries that can be reused to lower the cost of future implementation.
- Rule 40—Strive for Statelessness — 无状态web应用;不大懂web开发;可能是将session信息存储在应用服务器端的成本太高
- What: Design and implement stateless systems.
- When to use: During design of new systems and redesign of existing systems.
- How to use: Choose stateless implementations whenever possible. If stateful implementations are warranted for business reasons, refer to Rules 41 and 42.
- Why: The implementation of state limits scalability and increases cost.
- Rule 41—Maintain Sessions in the Browser When Possible
- What: Try to avoid session data completely, but when needed, consider putting the data in users’ browsers.
- When to use: Anytime that you need session data for the best user experience.
- How to use: Use cookies to store session data on the users’ browsers.
- Why: Keeping session data on the users’ browsers allows the user request to be served by any Web server in the pool and takes the storage requirement away from your system.
- Key takeaways: Using cookies to store session data is a common approach and has advantages in terms of ease of scale but also has some drawbacks. One of the most critical cons is that unsecured cookies
- Rule 42—Make Use of a Distributed Cache for States –可以将session信息持久化到数据库中,同时在应用服务器和数据库之前提供一层高可用的cache;
- What: Use a distributed cache when storing session data in your system.
- When to use: Anytime you need to store session data and cannot do so in users’ browsers.
- How to use: Watch for some common mistakes such as a session management system that requires affinity of a user to a Web server.
- Why: Careful consideration of how to store session data can help ensure your system will continue to scale.
- Key takeaways: Many Web servers or languages offer simple server-based session management, but these are often fraught with problems such as user affiliation with specific servers. Implementing a distributed cache will allow you to store session data in your system and continue to scale.
网络实施
- Rule 20—Leverage CDNs — 收益和成本的balance
- What: Use CDNs to offload traffic from your site.
- When to use: Ensure it is cost justified and then choose which content is most suitable.
- How to use: Most CDNs leverage DNS to serve content on your site’s behalf.
- Why: CDNs help offload traffic spikes and are often economical ways to scale parts of a site’s traffic.
- Key takeaways: CDNs are a fast and simple way to offset spikiness of traffic as well as traffic growth in general. Ensure you perform a cost-benefit analysis and monitor the CDN usage.
- Rule 6—Use Homogenous Networks
- What: Don’t mix the vendor networking gear.
- When to use: When designing or expanding your network.
- How to use:
+ Do not mix different vendors’ networking gear (switches and routers).
+ Buy best of breed for other networking gear (firewalls, load balancers, and so on).- Why: Intermittent interoperability and availability issues simply aren’t worth the potential cost savings.
- Key takeaways: Heterogeneous networking gear tends to cause availability and scalability problems. Choose a single provider.
数据库相关
- Rule 31—Be Aware of Costly Relationships — FK约束会对数据库的分库分表等行为有一定限制
- What: Be aware of relationships in the data model.
- When to use: When designing the data model, adding tables/columns, or writing queries consider how the relationships between entities will affect performance and scalability in the long run.
- How to use: Think about database splits and possible future data needs as you design the data model.
- Why: The cost of fixing a broken data model after it has been implemented is likely 100x as much as fixing it during the design phase.
- Key takeaways: Think ahead and plan the data model carefully. Consider normalized forms, how you will likely split the database in the future, and possible data needs of the application.
- Rule 14—Use Databases Appropriately — 选择传统关系数据库?
- What: Use relational databases when you need ACID properties to maintain relationships between your data. For other data storage needs consider more appropriate tools.
- How to use: Consider the data volume, amount of storage, response time requirements, relationships, and other factors to choose the most appropriate storage tool.
Why: RDBMSs provide great transactional integrity but are more difficult to scale, cost more, and have lower availability than many other storage options.- Key takeaways: Use the right storage tool for your data. Don’t get lured into sticking everything in a relational database just because you are comfortable accessing data in a database.
- Rule 7—Design to Clone Things (X Axis) –(复制:一写N读)
- What: Typically called horizontal scale, this is the duplication of services or databases to spread transaction load.
- How to use:
+ Simply clone services and implement a load balancer.
+ For databases, ensure the accessing code understands the difference between a read and a write.
- Rule 8—Design to Split Different Things (Y Axis) — 按照function或者service分库
- Rule 9—Design to Split Similar Things (Z Axis) — 分表
- What: This is often a split by some unique aspect of the customer such as customer ID, name, geography, and so on.
- When to use:Very large, similar data sets such as large and rapidly growing customer bases.
- How to use: Identify something you know about the customer, such as customer ID, last name, geography, or device and split or partition both data and services based on that attribute.
- Why: Rapid customer growth exceeds other forms of data growth or you have the need to perform “fault isolation” between certain customer groups as you scale.
- Rule 35—Don’t Select Everything
- What: Don’t use Select * in queries.
- When to use: Never select everything (unless of course you are going to use everything).
- How to use: Always declare what columns of data you are selecting or inserting in a query.
- Why: Selecting everything in a query is prone to break things when the table structure changes and it transfers unneeded data.
- Key takeaways: Don’t use wildcards when selecting or inserting data.
Datacenter实施
- Rule 12—Scale Out Your Data Centers
- What: Design your systems to have three or more live data centers to reduce overall cost, increase availability, and implement disaster recovery.
- When to use: Any rapidly growing business that is considering adding a disaster recovery (cold site) data center.
- How to use: Split up your data to spread across data centers and spread transaction load across those data centers in a “multiple live” configuration. Use spare capacity for peak periods of the year.
运维
- Rule 30—Discuss and Learn from Failures
- What: Leverage every failure to learn and teach important lessons.
- How to use: Employ a postmortem process and hypothesize failures in low failure environments.
- Why:We learn best from our mistakes—not our successes.
Key takeaways: Never let a good failure go to waste. Learn from every one and identify the technology, people, and process issues that need to be corrected.
- Rule 27—Learn Aggressively — 27&30 2个rules都是为了避免incident成为mistake
- When to use: Be constantly learning from your mistakes as well as successes.
- How to use: Watch your customers or use A/B testing to determine what works. Use postmortems to learn from incidents and problems in production.
- Why: Doing something without measuring the results or having an incident without learning from it are wasted opportunities that your competitors are taking advantage of.
- Rule 16—Actively Use Log Files — for DW or troubleshooting
- What: Use your application’s log files to diagnose and prevent problems.
When to use: Put a process in place that monitors log files and forces people to take action on issues identified.- How to use: Use any number of monitoring tools from custom scripts to Splunk to watch your application logs for errors. Export these and assign resources for identifying and solving the issue.
- Why: The log files are excellent sources of information about how your application is performing for your users; don’t throw this resource away without using it.
- Key takeaways: Make good use of your log files and you will have fewer production issues with your system.
- Rule 29—Failing to Design for Rollback Is Designing for Failure — 减少失败就是提高可用性
- Rule 49—Design Your Application to Be Monitored
- What: Think about how you will need to monitor your application as you are designing it.
- When to use: Anytime you are adding or changing modules of your code base.
- How to use: Build hooks into your system to record transaction times.
- Why: Having insight into how your application is performing will help answer many questions when there is a problem.
- Key takeaways: Adopt as an architectural principle that your application must be monitored. Additionally, look at your overall monitoring strategy to make sure you are first answering the question of “Is there a problem?” and then the “Where” and “What.”
- Rule 47—Purge, Archive, and Cost-Justify Storage — 真对OLTP数据库,归档数据可以保证cache命中率
- What: Match storage cost to data value, including removing data of value lower than the costs to store it.
- Why: Not all data is created equal (that is, of the same value) and in fact it often changes in value over time.Why then should we have a single storage solution with equivalent cost for that data?
- Key takeaways: It is important to understand and calculate the value of your data and to match storage costs to that value. Don’t pay for data that doesn’t have a stakeholder return.
无状态web应用;不大懂web开发;可能是将session信息存储在应用服务器端的成本太高
有状态的东西不能随意的动态扩展,以及迁移,否则相应的状态信息会丢失,对于Web应用来讲,session信息是状态,数据库内存储的信息是状态,分布式缓存中存储的信息也是状态,应用处理逻辑与状态分离之后,应用就比较容易迁移、转移、扩容。 当然,前提是存储状态的地方可以有足够的容量与效率来进行支撑。
减少失败就是提高可用性
这条规则的含义,不是减少失败,关键点是应用要可以随意失败,也就是在整个架构层面做到比较好的Fault Tolerance,单个点的失败、故障,不会演变成整个系统的失败、故障。这样,单台服务器,单个机架,甚至单个数据中心的故障都不会显著影响使用服务的用户,用户的请求可以在LB层自动的切换到其他可以提供同样服务的主机上。
其中,隐含的假设是,状态(上一个回复中的关键词)可以做到类似的情况,实际上,但状态本身要求非常高的一致性的时候,这一点是无法满足的,或者说,我们必须回到CAP的原点来讨论,到底是牺牲A还是P来满足高C的需求。