技术分享——NOSQL手段篇 -

dongbao100

浏览: 34464 次
性别:
来自: 沈阳

最近访客更多访客>>

lanyan_lan

luyongfugx

博主相关

博客

微博

相册

留言

关于我

文章分类

社区版块

存档分类

技术分享——NOSQL手段篇

博客分类：

nosql

nosql

二. 手段篇

1. 一致性哈希

要从分布式架构的发展说起。

(1). 第一阶段

考虑到单服务器不能承载，因此使用了分布式架构，最初的算法为hash() mod n, hash() 通常取用户ID ，n 为节点数。此方法容易实现且能够满足运营要求。

缺点

当单点发生故障时，系统无法自动恢复。

(2). 第二阶段

为了解决单点故障，使用 hash() mod (n/2) ，这样任意一个用户都有2 个服务器备选，可由client 随机选取。由于不同服务器之间的用户需要彼此交互，所以所有的服务器需要确切的知道用户所在的位置。因此用户位置被保存到memcached 中。当一台发生故障，client 可以自动切换到对应backup ，由于切换前另外1 台没有用户的session ，因此需要client 自行重新登录。

缺点

· 负载不均衡，尤其是单台发生故障后剩下一台会压力过大。

· 不能动态增删节点

· 节点发生故障时需要client 重新登录

(3). 第三阶段

打算去掉硬编码的hash() mod n 算法，改用一致性哈希(consistent hashing) 分布。假如采用Dynamo 中的strategy 1 ，我们把每台server 分成v 个虚拟节点，再把所有虚拟节点(n*v) 随机分配到一致性哈希的圆环上，这样所有的用户

从自己圆环上的位置顺时针往下取到第一个vnode 就是自己所属节点。当此节点存在故障时，再顺时针取下一个作为替代节点。

优点

发生单点故障时负载会均衡分散到其他所有节点，程序实现也比较优雅。

(4). 亚马逊的现状 - 内部网络延迟

aw2.0 公司的Alan Williamson 撰写了一篇报道，主要是关于他在Amazon EC2 上的体验的，他抱怨说，Amazon 是公司唯一使用的云提供商，看起来它在开始时能够适应得很好，但是有一个临界点:

在开始的日子里Amazon 的表现非常棒。实例在几分钟内启动，几乎没有遇到任何问题，即便是他们的小实例(SMALL INSTANCE) 也很健壮，足以支持适当使用的MySQL 数据库。在20 个月内，Amazon 云系统一切运转良好，不需要任何的关心和抱怨。然而，在最后的八个月左右，他们" 盔甲" 内的漏洞开始呈现出来了。第一个弱点前兆是，新加入的Amazon SMALL 实例的性能出现了问题。根据我们的监控，在服务器场中新添加的机器，与原先的那些相比性能有所下降。开始我们认为这是自然出现的怪现象，只是碰巧发生在" 吵闹的邻居"(Noisy Neighbors) 旁边。根据随机法则，一次快速的停机和重新启动经常就会让我们回到" 安静的邻居" 旁边，那样我们可以达到目的。然而，在最后的一两个月中，我们发现，甚至是这些" 使用高级CPU 的中等实例" 也遭受了与小实例相同的命运，其中，新的实例不管处于什么位置，看起来似乎都表现得一样。经过调查，我们还发现了一个新问题，它已经悄悄渗透到到Amazon 的世界中，那就是内部网络延迟。

(5). 算法的选择

不同的哈希算法可以导致数据分布的不同位置，如果十分均匀，那么一次MapReduce 就涉及节点较多，但热点均匀，方便管理。反之，热点不均，会大致机器效率发挥不完全。

2.Quorum NRW

(1). 结构图

· N: 复制的节点数量

· R: 成功读操作的最小节点数

· W: 成功写操作的最小节点数

只需W + R > N ，就可以保证强一致性。

(2).N 、W 、R 参数与CAP

· 第一个关键参数是N ，这个N 指的是数据对象将被复制到N 台主机上。N 在实例级别配置，协调器将负责把数据复制到N-1 个节点上。N 的典型值设置为3 。

· 复制中的一致性，采用类似于Quorum 系统的一致性协议实现。这个协议有两个关键值:R 与W 。R 代表一次成功的读取操作中最小参与节点数量，W 代表一次成功的写操作中最小参与节点数量。R+W>N ，则会产生类似quorum 的效果。该模型中的读( 写) 延迟由最慢的 R(W) 复制决定，为得到比较小的延迟，R 和W 有的时候的和又设置比N 小。

· 如果N 中的1 台发生故障，Dynamo 立即写入到preference list 中下一台，确保永远可写入。

· 如果W+R>N ，那么分布式系统就会提供强一致性的保证，因为读取数据的节点和被同步写入的节点是有重叠的。在一个RDBMS 的复制模型中(Master/slave) ，假如N=2 ，那么W=2 、R=1 此时是一种强一致性，但是这样造成的问题就是可用性的减低，因为要想写操作成功，必须要等 2 个节点都完成以后才可以。

· 在分布式系统中，一般都要有容错性，因此一般N 都是大于3 的。此时根据CAP 理论，一致性、可用性和分区容错性最多只能满足两个，那么我们就需要在一致性和分区容错性之间做一平衡，如果要高的一致性，那么就配置N=W 、R=1 ，这个时候可用性就会大大降低。如果想要高的可用性，那么此时就需要放松一致性的要求，此时可以配置W=1 ，这样使得写操作延迟最低，同时通过异步的机制更新剩余的N-W 个节点。

· 当存储系统保证最终一致性时，存储系统的配置一般是W+R<=N ，此时读取和写入操作是不重叠的，不一致性的窗口就依赖于存储系统的异步实现方式，不一致性的窗口大小也就等于从更新开始到所有的节点都异步更新完成之间的时间。

· (N,R,W) 的值典型设置为(3, 2 ,2) ，兼顾性能与可用性。R 和W 直接影响性能、扩展性、一致性。如果W 设置为1 ，则一个实例中只要有一个节点可用，也不会影响写操作。如果R 设置为1 ，只要有一个节点可用，也不会影响读请求。R 和W 值过小则影响一致性，过大也不好，这两个值要平衡。对于这套系统的典型的SLA 要求99.9% 的读写操作在300ms 内完成。

· 无论是Read-your-writes-consistency 、Session consistency 、Monotonic read consistency ，它们都通过黏贴(stickiness) 客户端到执行分布式请求的服务器端来实现的，这种方式简单是简单，但是它使得负载均衡以及分区容错变的更加难于管理，有时候也可以通过客户端来实现Read-your-writes-consistency 和Monotonicread consistency 。此时需要对写的操作的数据加版本号，这样客户端就可以遗弃版本号小于最近看到的版本号的数据。

· 在系统开发过程中，根据CAP 理论，可用性和一致性在一个大型分区容错的系统中只能满足一个，因此为了高可用性，我们必须放低一致性的要求，但是不同的系统保证的一致性还是有差别的，这就要求开发者要清楚自己用的系统提供什么样子的最终一致性的保证，一个非常流行的例子就是web 应用系统，在大多数的web 应用系统中都有" 用户可感知一致性" 的概念，这也就是说最终一致性中的" 一致性窗口" 大小要小于用户下一次的请求，在下次读取操作来之前，数据可以在存储的各个节点之间复制。还比如假如存储系统提供了read-your-write-consistency 一致性，那么当一个用户写操作完成以后可以立马看到自己的更新，但是其它的用户要过一会才可以看到更新。

(3). 几种特殊情况

· W = 1, R = N

对写操作要求高性能高可用。

· R = 1, W = N

对读操作要求高性能高可用，比如类似cache 之类业务。

· W = Q, R = Q where Q = N / 2 + 1

一般应用适用，读写性能之间取得平衡。如N=3,W=2,R=2 。

3.Vector clock

(1). 结构图

(2). 算法说明

vector clock 算法。可以把这个vector clock 想象成每个节点都记录自己的版本信息，而一个数据，包含所有这些版本信息。

例子

· 假设一个写请求，第一次被节点A 处理了。节点A 会增加一个版本信息(A ，1) 。我们把这个时候的数据记做D1(A ，1) 。然后另外一个对同样key( 这一段讨论都是针对同样的key 的) 的请求还是被A 处理了于是有D2(A ，2) 。这个时候，D2 是可以覆盖D1 的，不会有冲突产生。现在我们假设D2 传播到了所有节点(B 和C) ，B 和C 收到的数据不是从客户产生的，而是别人复制给他们的，所以他们不产生新的版本信息，所以现在B 和C 都持有数据D2(A ，2) 。好，继续，又一个请求，被B 处理了，生成数据D3(A ，2;B ，1) ，因为这是一个新版本的数据，被B 处理，所以要增加B 的版本信息。

· 假设D3 没有传播到C 的时候又一个请求被C 处理记做D4(A ，2;C ，1) 。假设在这些版本没有传播开来以前，有一个读取操作，我们要记得，我们的W=1 那么R=N=3 ，所以R 会从所有三个节点上读，在这个例子中将读到三个版本。A 上的D2(A ，2);B 上的D3(A ，2;B ，1);C 上的D4(A ，2;C ，1) 这个时候可以判断出，D2 已经是旧版本，可以舍弃，但是D3 和D4 都是新版本，需要应用自己去合并。

· 如果需要高可写性，就要处理这种合并问题。好假设应用完成了冲入解决，这里就是合并D3 和D4 版本，然后重新做了写入，假设是B 处理这个请求，于是有D5(A ，2;B ，2;C ，1); 这个版本将可以覆盖掉D1-D4 那四个版本。这个例子只举了一个客户的请求在被不同节点处理时候的情况，而且每次写更新都是可接受的，大家可以自己更深入的演算一下几个并发客户的情况，以及用一个旧版本做更新的情况。

· 上面问题看似好像可以通过在三个节点里选择一个主节点来解决，所有的读取和写入都从主节点来进行。但是这样就违背了W=1 这个约定，实际上还是退化到W=N 的情况了。所以如果系统不需要很大的弹性，W=N 为所有应用都接受，那么系统的设计上可以得到很大的简化。Dynamo 为了给出充分的弹性而被设计成完全的对等集群(peer to peer) ，网络中的任何一个节点都不是特殊的。

4.Virtual node

(1). 结构图

虚拟节点，未完成

5.Gossip

Gossip 协议是一个Gossip 思想的P2P 实现。现代的分布式系统经常使用这个协议，他往往是唯一的手段。因为底层的结构非常复杂，而且Gossip 也很有效。Gossip 协议也被戏称为病毒式传播，因为他的行为生物界的病毒很相似。

(1).Gossip (State Transfer Model)

在状态转移到模式下，每个重复节点都保持的一个Vector clock 和一个state version tree 。每个节点的状态都是相同的(based on vector clock comparison), 换句话说，state version tree 包含有全部的冲突updates.

At query time, the client will attach its vector clock and the replica will send back a subset of the state tree which precedes the client's vector clock (this will provide monotonic read consistency). The client will then advance its vector clock by merging all the versions. This means the client is responsible to resolve the conflict of all these versions because when the client sends the update later, its vector clock will precede all these versions.

At update, the client will send its vector clock and the replica will check whether the client state precedes any of its existing version, if so, it will throw away the client's update.

Replicas also gossip among each other in the background and try to merge their version tree together.

(2).Gossip (Operation Transfer Model)

In an operation transfer approach, the sequence of applying the operations is very important. At the minimum causal order need to be maintained. Because of the ordering issue, each replica has to defer executing the operation until all the preceding operations has been executed. Therefore replicas save the operation request to a log file and exchange the log among each other and consolidate these operation logs to figure out the right sequence to apply the operations to their local store in an appropriate order.

"Causal order" means every replica will apply changes to the "causes" before apply changes to the "effect". "Total order" requires that every replica applies the operation in the same sequence.

In this model, each replica keeps a list of vector clock, Vi is the vector clock the replica itself and Vj is the vector clock when replica i receive replica j's gossip message. There is also a Vstate

that represent the vector clock of the last updated state.

When a query is submitted by the client, it will also send along its vector clock which reflect the client's view of the world. The replica will check if it has a view of the state that is later than

the client's view.

When an update operation is received, the replica will buffer the update operation until it can be applied to the local state. Every submitted operation will be tag with 2 timestamp, V-client

indicates the client's view when he is making the update request. V-@receive is the replica's view when it receives the submission.This update operation request will be sitting in the queue until the replica has received all the other updates that this one depends on. This condition is reflected in the vector clock Vi when it is larger than V-client

On the background, different replicas exchange their log for the queued updates and update each other's vector clock. After the log exchange, each replica will check whether certain operation can be applied (when all the dependent operation has been received) and apply them accordingly. Notice that it is possible that multiple operations are ready for applying at the same time, the replica will sort these operation in causal order (by using the Vector clock comparison) and apply them in the right order.

The concurrent update problem at different replica can also happen. Which means there can be multiple valid sequences of operation. In order for different replica to apply concurrent update in

the same order, we need a total ordering mechanism.

One approach is whoever do the update first acquire a monotonic sequence number and late comers follow the sequence. On the other hand, if the operation itself is commutative, then the order to apply the operations doesn't matter

After applying the update, the update operation cannot be immediately removed from the queue because the update may not be fully exchange to every replica yet. We continuously check the Vector clock of each replicas after log exchange and after we confirm than everyone has receive this update, then we'll remove it from the queue.

5.Merkle tree

有数据存储成树状结构，每个节点的Hash 是其所有子节点的Hash 的Hash ，叶子节点的Hash 是其内容的Hash 。这样一旦某个节点发生变化，其Hash 的变化会迅速传播到根节点。需要同步的系统只需要不断查询跟节点的hash ，一旦有变化，顺着树状结构就能够在logN 级别的时间找到发生变化的内容，马上同步。

6.Paxos

paxos 是一种处理一致性的手段，可以理解为事务吧。其他的手段不要Google GFS 使用的Chubby 的Lock service 。我不大喜欢那种重型的设计就不费笔墨了。当规模越来越大的时候...

(1).Master/slave

这个是多机房数据访问最常用的方案，一般的需求用此方案即可。因此大家也经常提到"premature optimization is the root of all evil" 。

优点

利用mysql replication 即可实现，成熟稳定。

缺点

写操作存在单点故障，master 坏掉之后slave 不能写。另外slave 的延迟也是个困扰人的小问题。

(2).Multi-master

Multi-master 指一个系统存在多个master, 每个master 都具有read-write 能力，需根据时间戳或业务逻辑合并版本。比如分布式版本管理系统git 可以理解成multi-master 模式。具备最终一致性。多版本数据修改可以借鉴Dynamo 的vector clock 等方法。

优点

解决了单点故障。

缺点

不易实现一致性，合并版本的逻辑复杂。

(3).Two-phase commit(2PC)

Two-phase commit 是一个比较简单的一致性算法。由于一致性算法通常用神话( 如Paxos 的The Part-Time Parliament 论文) 来比喻容易理解。

示例

某班要组织一个同学聚会，前提条件是所有参与者同意则活动举行，任意一人拒绝则活动取消。用2PC 算法来执行过程如下:

Prepare: 组织者(coordinator) 打电话给所有参与者(participant) ，同时告知参与者列表。

Proposal: 提出周六2pm-5pm 举办活动。

Vote: participant 需vote 结果给coordinator:accept or reject 。

Block: 如果accept ，participant 锁住周六2pm-5pm 的时间，不再接受其他请求。

Commit: 如果所有参与者都同意，组织者coodinator 通知所有参与者commit ，否则通知abort ，participant 解除锁定。

Failure 典型失败情况分析

· Participant failure

任一参与者无响应，coordinator 直接执行abort

· Coordinator failure

Takeover: 如果participant 一段时间没收到cooridnator 确认(commit/abort) ，则认为coordinator 不在了。

这时候可自动成为Coordinator 备份(watchdog)

Query: watchdog 根据phase 1 接收的participant 列表发起query

Vote: 所有participant 回复vote 结果给watchdog, accept or reject

Commit: 如果所有都同意，则commit ，否则abort 。

优点

实现简单。

缺点

所有参与者需要阻塞(block) ，throughput 低；无容错机制，一节点失败则整个事务失败。

(4).Three-phase commit (3PC)

Three-phase commit 是一个2PC 的改进版。2PC 有一些很明显的缺点，比如在coordinator 做出commit 决策并开始发送commit 之后，某个participant 突然crash ，这时候没法abort transaction ，这时候集群内实际上就存在不一致的情况，crash 恢复后的节点跟其他节点数据是不同的。因此3PC 将2PC 的commit 的过程1 分为2 ，分成preCommit 及commit 。

cohorts(participant) 收到preCommit 之后，如果没收到commit ，默认也执行commit 。即图上的timeout cause commit 。如果coodinator 发送了一半preCommit crash ，watchdog 接管之后通过query ，如果有任一节点收到commit ，或者全部节点收到preCommit ，则可继续commit ，否则abort 。

优点

允许发生单点故障后继续达成一致。

缺点

网络分离问题，比如preCommit 消息发送后突然两个机房断开，这时候coodinator 所在机房会abort ，另外剩余replicas 机房会commit 。

(5). 对比

Google Chubby 的作者Mike Burrows 说过，"there is only one consensus protocol, and that ’ s Paxos" – all other approaches are just broken versions of Paxos. 意即" 世上只有一种一致性算法，那就是Paxos" ，所有其他一致性算法都是Paxos 算法的不完整版。相比2PC/3PC ，Paxos 算法的改进:

· P1. 每次Paxos 实例执行都分配一个编号，编号需要递增，每个replica 不接受比当前最大编号小的提案

· P2. 一旦一个value v 被replica 通过，那么之后任何再批准的value 必须是v ，即没有拜占庭将军(Byzantine) 问题。拿上面请客的比喻来说，就是一个参与者一旦accept 周六2pm-5pm 的proposal ，就不能改变主意。以后不管谁来问都是accept 这个value 。

一个proposal 只需要多数派同意即可通过。因此比2PC/3PC 更灵活，在一个2f+1 个节点的集群中，允许有f 个节点不可用。另外Paxos 还有很多约束的细节，特别是Google 的chubby 从工程实现的角度将Paxos 的细节补充得非常完整。比如如何避免Byzantine 问题，由于节点的持久存储可能会发生故障，Byzantine 问题会导致Paxos 算法P2 约束失效。

7.DHT

http://en.wikipedia.org/wiki/Distributed_hash_table

8.Map Reduce Execution

http://zh.wikipedia.org/wiki/MapReduce

9.Handling Deletes

但我们执行删除操作的时候必须非常谨慎，以防丢失掉相应的版本信息。通常我们给一个Object 标注上" 已删除" 的标签。在足够的时间之后，我们在确保版本一致的情况下可以将它彻底删除。回收他的空间。

10. 存储实现

One strategy is to use make the storage implementation pluggable. e.g. A local MySQL DB,Berkeley DB, Filesystem or even a in memory Hashtable can be used as a storage mechanism.

Another strategy is to implement the storage in a highly scalable way. Here are some techniques that I learn from CouchDB and Google BigTable.

CouchDB has a MVCC model that uses a copy-on-modified approach. Any update will cause a private copy being made which in turn cause the index also need to be modified and causing the a private copy of the index as well, all the way up to the root pointer.

Notice that the update happens in an append-only mode where the modified data is appended to the file and the old data becomes garbage. Periodic garbage collection is done to compact the data. Here is how the model is implemented in memory and disks

In Google BigTable model, the data is broken down into multiple generations and the memory is use to hold the newest generation. Any query will search the mem data as well as all the data sets on disks and merge all the return results. Fast detection of whether a generation contains a key can be done by checking a bloom filter.

When update happens, both the mem data and the commit log will be written so that if the

11. 节点变化

Notice that virtual nodes can join and leave the network at any time without impacting the operation of the ring.

When a new node joins the network

· 新加入的节点宣告自己的存在( 广播或者其他手段)

· 他的邻居节点要调整Key 的分配和复制关系。这个操作通常是同步的

· 这个新加入的节点异步的拷贝数据

· 这个节点变化的操作被发布到其他节点

Notice that other nodes may not have their membership view updated yet so they may still forward the request to the old nodes. But since these old nodes (which is the neighbor of the new joined node) has been updated (in step 2), so they will forward the request to the new joined node.

On the other hand, the new joined node may still in the process of downloading the data and not ready to serve yet. We use the vector clock (described below) to determine whether the new joined node is ready to serve the request and if not, the client can contact another replica.

When an existing node leaves the network (e.g. crash)

· The crashed node no longer respond to gossip message so its neighbors knows about it. 崩溃的节点不再发送Gossip Message 的回应，所以他的邻居都知道他是了

· The neighbor will update the membership changes and copy data asynchronously ，他的邻居处理后事，将他的活分给别人干，同时调整节点关系。

We haven't talked about how the virtual nodes is mapped into the physical nodes. Many schemes are possible with the main goal that Virtual Node replicas should not be sitting on the same physical node. One simple scheme is to assigned Virtual node to Physical node in a random manner but check to make sure that a physical node doesn't contain replicas of the same key ranges.

Notice that since machine crashes happen at the physical node level, which has many virtual nodes runs on it. So when a single Physical node crashes, the workload (of its multiple virtual node) is scattered across many physical machines. Therefore the increased workload due to physical node crashes is evenly balanced.

12. 列式存储

(1). 描述

数据库以行、列的二维表的形式存储数据，但是却以一维字符串的方式存储，例如以下的一个表: