REDIS-CLUSTER-2 15 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343
  1. Redis Cluster - Alternative 1
  2. 28 Apr 2010: Ver 1.0 - initial version
  3. Overview
  4. ========
  5. The motivations and design goals of Redis Cluster are already outlined in the
  6. first design document of Redis Cluster. This document is just an attempt to
  7. provide a completely alternative approach in order to explore more ideas.
  8. In this document the alternative explored is a cluster where communication is
  9. performed directly from client to the target node, without intermediate layer.
  10. The intermediate layer can be used, in the form of a proxy, in order to provide
  11. the same functionality to clients not able to directly use the cluster protocol.
  12. So in a first stage clients can use a proxy to implement the hash ring, but
  13. later this clients can switch to a native implementation, following a
  14. specification that the Redis project will provide.
  15. In this new design fault tolerance is achieved by replicating M-1 times every
  16. data node instead of storing the same key M times across nodes.
  17. From the point of view of CAP our biggest sacrifice is about "P", that is
  18. resistance to partitioning. Only M-1 nodes can go down for the cluster still
  19. be functional. Also when possible "A" is somewhat sacrificed for "L", that
  20. is, Latency. Not really in the CAP equation but a very important parameter.
  21. Network layout
  22. ==============
  23. In this alternative design the network layout is simple as there are only
  24. clients talking directly to N data nodes. So we can imagine to have:
  25. - K Redis clients, directly talking to the data nodes.
  26. - N Redis data nodes, that are, normal Redis instances.
  27. Data nodes are replicate M-1 times (so there are a total of M copies for
  28. every node). If M is one, the system is not fault tolerant. If M is 2 one
  29. data node can go off line without affecting the operations. And so forth.
  30. Hash slots
  31. ==========
  32. The key space is divided into 1024 slots.
  33. Given a key, the SHA1 function is applied to it.
  34. The first 10 bytes of the SHA1 digest are interpreted as an unsigned integer
  35. from 0 to 1023. This is the hash slot of the key.
  36. Data nodes
  37. ==========
  38. Data nodes are normal Redis instances, but a few additional commands are
  39. provided.
  40. HASHRING ADD ... list of hash slots ...
  41. HASHRING DEL ... list of hash slots ...
  42. HASHRING REHASHING slot
  43. HASHRING SLOTS => returns the list of configured slots
  44. HSAHRING KEYS ... list of hash slots ...
  45. By default Redis instances are configured to accept operations about all
  46. the hash slots. With this commands it's possible to configure a Redis instance
  47. to accept only a subset of the key space.
  48. If an operation is performed against a key hashing to a slot that is not
  49. configured to be accepted, the Redis instance will reply with:
  50. "-ERR wrong hash slot"
  51. More details on the HASHRING command and sub commands will be showed later
  52. in this document.
  53. Additionally three other commands are added:
  54. DUMP key
  55. RESTORE key <dump data>
  56. MIGRATE key host port
  57. DUMP is used to output a very compact binary representation of the data stored at key.
  58. RESTORE re-creates a value (storing it at key) starting from the output produced by DUMP.
  59. MIGRATE is like a server-side DUMP+RESTORE command. This atomic command moves one key from the connected instance to another instance, returning the status code of the operation (+OK or an error).
  60. The protocol described in this draft only uses the MIGRATE command, but this in turn will use RESTORE internally when connecting to another server, and DUMP is provided for symmetry.
  61. Querying the cluster
  62. ====================
  63. 1) Reading the cluster config
  64. -----------------------------
  65. Clients of the cluster are required to have the cluster configuration loaded
  66. into memory. The cluster configuration is the sum of the following info:
  67. - Number of data nodes in the cluster, for instance, 10
  68. - A map between hash slots and nodes, so for instnace:
  69. hash slot 1 -> node 0
  70. hash slot 2 -> node 5
  71. hash slot 3 -> node 3
  72. ... and so forth ...
  73. - Physical address of nodes, and their replicas.
  74. node 0 addr -> 192.168.1.100
  75. node 0 replicas -> 192.168.1.101, 192.168.1.105
  76. - Configuration version: the SHA1 of the whole configuration
  77. The configuration is stored in every single data node of the cluster.
  78. A client without the configuration in memory is require, as a first step, to
  79. read the config. In order to do so the client requires to have a list of IPs
  80. that are with good probability data nodes of the cluster.
  81. The client will try to get the config from all this nodes. If no node is found
  82. responding, an error is reported to the user.
  83. 2) Caching and refreshing the configuration
  84. -------------------------------------------
  85. A node is allowed to cache the configuration in memory or in a different way
  86. (for instance storing the configuration into a file), but every client is
  87. required to check if the configuration changed at max every 10 seconds, asking
  88. for the configuration version key with a single GET call, and checking if the
  89. configuration version matches the one loaded in memory.
  90. Also a client is required to refresh the configuration every time a node
  91. replies with:
  92. "-ERR wrong hash slot"
  93. As this means that hash slots were reassigned in some way.
  94. Checking the configuration every 10 seconds is not required in theory but is
  95. a good protection against errors and failures that may happen in real world
  96. environments. It is also very cheap to perform, as a GET operation from time
  97. to time is going to have no impact in the overall performance.
  98. 3) Read query
  99. -------------
  100. To perform a read query the client hashes the key argument from the command
  101. (in the intiial version of Redis Cluster only single-key commands are
  102. allowed). Using the in memory configuration it maps the hash key to the
  103. node ID.
  104. If the client is configured to support read-after-write consistency, then
  105. the "master" node for this hash slot is queried.
  106. Otherwise the client picks a random node from the master and the replicas
  107. available.
  108. 4) Write query
  109. --------------
  110. A write query is exactly like a read query, with the difference that the
  111. write always targets the master node, instead of the replicas.
  112. Creating a cluster
  113. ==================
  114. In order to create a new cluster, the redis-cluster command line utility is
  115. used. It gets a list of available nodes and replicas, in order to write the
  116. initial configuration in all the nodes.
  117. At this point the cluster is usable by clients.
  118. Adding nodes to the cluster
  119. ===========================
  120. The command line utility redis-cluster is used in order to add a node to the
  121. cluster:
  122. 1) The cluster configuration is loaded.
  123. 2) A fair number of hash slots are assigned to the new data node.
  124. 3) Hash slots moved to the new node are marked as "REHASHING" in the old
  125. nodes, using the HASHRING command:
  126. HASHRING SETREHASHING 1 192.168.1.103 6380
  127. The above command set the hash slot "1" in rehashing state, with the
  128. "forwarding address" to 192.168.1.103:6380. As a result if this node receives
  129. a query about a key hashing to hash slot 1, that *is not present* in the
  130. current data set, it replies with:
  131. "-MIGRATED 192.168.1.103:6380"
  132. The client can then reissue the query against the new node.
  133. Instead even if the hash slot is marked as rehashing but the requested key
  134. is still there, the query is processed. This allows for non blocking
  135. rehashing.
  136. Note that no additional memory is used by Redis in order to provide such a
  137. feature.
  138. 4) While the Hash slot is marked as "REHASHING", redis-cluster asks this node
  139. the list of all the keys matching the specified hash slot. Then all the keys
  140. are moved to the new node using the MIGRATE command.
  141. 5) Once all the keys are migrated, the hash slot is deleted from the old
  142. node configuration with "HASHRING DEL 1". And the configuration is update.
  143. Using this algorithm all the hash slots are migrated one after the other to the new node. In practical implementation before to start the migration the
  144. redis-cluster utility should write a log into the configuration so that
  145. in case of crash or any other problem the utility is able to recover from
  146. were it left.
  147. Fault tolerance
  148. ===============
  149. Fault tolerance is reached replicating every data node M-1 times, so that we
  150. have one master and M-1 replicas for a total of M nodes holding the same
  151. hash slots. Up to M-1 nodes can go down without affecting the cluster.
  152. The tricky part about fault tolerance is detecting when a node is failing and
  153. signaling it to all the other clients.
  154. When a master node is failing in a permanent way, promoting the first slave
  155. is easy:
  156. 1) At some point a client will notice there are problems accessing a given node. It will try to refresh the config, but will notice that the config is already up to date.
  157. 2) In order to make sure the problem is not about the client connectivity itself, it will try to reach other nodes as well. If more than M-1 nodes appear to be down, it's either a client networking problem or alternatively the cluster can't be fixed as too many nodes are down anyway. So no action is taken, but an error is reported.
  158. 3) If instead only 1 or at max M-1 nodes appear to be down, the client promotes a slave as master and writes the new configuration to all the data nodes.
  159. All the other clients will see the data node not working, and as a first step will try to refresh the configuration. They will successful refresh the configuration and the cluster will work again.
  160. Every time a slave is promoted, the information is written in a log that is actually a Redis list, in all the data nodes, so that system administration tools can detect what happened in order to send notifications to the admin.
  161. Intermittent problems
  162. ---------------------
  163. In the above scenario a master was failing in a permanent way. Now instead
  164. let's think to a case where a network cable is not working well so a node
  165. appears to be a few seconds up and a few seconds down.
  166. When this happens recovering can be much harder, as a client may notice the
  167. problem and will promote a slave to master as a result, but then the host
  168. will be up again and the other clients will not see the problem, writing to
  169. the old master for at max 10 seconds (after 10 seconds all the clients are
  170. required to perform a few GETs to check the configuration version of the
  171. cluster and update if needed).
  172. One way to fix this problem is to delegate the fail over mechanism to a
  173. failover agent. When clients notice problems will not take any active action
  174. but will just log the problem into a redis list in all the reachable nodes,
  175. wait, check for configuration change, and retry.
  176. The failover agent constantly monitor this logs: if some client is reporting
  177. a failing node, it can take appropriate actions, checking if the failure is
  178. permanent or not. If it's not he can send a SHUTDOWN command to the failing
  179. master if possible. The failover agent can also consider better the problem
  180. checking if the failing mode is advertised by all the clients or just a single
  181. one, and can check itself if there is a real problem before to proceed with
  182. the fail over.
  183. Redis proxy
  184. ===========
  185. In order to make the switch to the clustered version of Redis simpler, and
  186. because the client-side protocol is non trivial to implement compared to the
  187. usual Redis client lib protocol (where a minimal lib can be as small as
  188. 100 lines of code), a proxy will be provided to implement the cluster protocol
  189. as a proxy.
  190. Every client will talk to a redis-proxy node that is responsible of using
  191. the new protocol and forwarding back the replies.
  192. In the long run the aim is to switch all the major client libraries to the
  193. new protocol in a native way.
  194. Supported commands
  195. ==================
  196. Because with this design we talk directly to data nodes and there is a single
  197. "master" version of every value (that's the big gain dropping "P" from CAP!)
  198. almost all the redis commands can be supported by the clustered version
  199. including MULTI/EXEC and multi key commands as long as all the keys will hash
  200. to the same hash slot. In order to guarantee this, key tags can be used,
  201. where when a specific pattern is present in the key name, only that part is
  202. hashed in order to obtain the hash index.
  203. Random remarks
  204. ==============
  205. - It's still not clear how to perform an atomic election of a slave to master.
  206. - In normal conditions (all the nodes working) this new design is just
  207. K clients talking to N nodes without intermediate layers, no routes:
  208. this means it is horizontally scalable with O(1) lookups.
  209. - The cluster should optionally be able to work with manual fail over
  210. for environments where it's desirable to do so. For instance it's possible
  211. to setup periodic checks on all the nodes, and switch IPs when needed
  212. or other advanced configurations that can not be the default as they
  213. are too environment dependent.
  214. A few ideas about client-side slave election
  215. ============================================
  216. Detecting failures in a collaborative way
  217. -----------------------------------------
  218. In order to take the node failure detection and slave election a distributed
  219. effort, without any "control program" that is in some way a single point
  220. of failure (the cluster will not stop when it stops, but errors are not
  221. corrected without it running), it's possible to use a few consensus-alike
  222. algorithms.
  223. For instance all the nodes may take a list of errors detected by clients.
  224. If Client-1 detects some failure accessing Node-3, for instance a connection
  225. refused error or a timeout, it logs what happened with LPUSH commands against
  226. all the other nodes. This "error messages" will have a timestamp and the Node
  227. id. Something like:
  228. LPUSH __cluster__:errors 3:1272545939
  229. So if the error is reported many times in a small amount of time, at some
  230. point a client can have enough hints about the need of performing a
  231. slave election.
  232. Atomic slave election
  233. ---------------------
  234. In order to avoid races when electing a slave to master (that is in order to
  235. avoid that some client can still contact the old master for that node in
  236. the 10 seconds timeframe), the client performing the election may write
  237. some hint in the configuration, change the configuration SHA1 accordingly and
  238. wait for more than 10 seconds, in order to be sure all the clients will
  239. refresh the configuration before a new access.
  240. The config hint may be something like:
  241. "we are switching to a new master, that is x.y.z.k:port, in a few seconds"
  242. When a client updates the config and finds such a flag set, it starts to
  243. continuously refresh the config until a change is noticed (this will take
  244. at max 10-15 seconds).
  245. The client performing the election will wait that famous 10 seconds time frame
  246. and finally will update the config in a definitive way setting the new
  247. slave as mater. All the clients at this point are guaranteed to have the new
  248. config either because they refreshed or because in the next query their config
  249. is already expired and they'll update the configuration.
  250. EOF