Working in a global retail environment poses some interesting availability challenges when you have physical Bricks and Mortar stores. I have been thinking about the problem of high availability in this environment for a little while now due to a project I am involved with to harmonise the retail systems used between global groups. It is quite a common problem for an organisation that has grown through acquisition that you have different systems used in different business units, but after a while it makes sense to try and go with a common platform.
This article talks about how this architecture could look and how you can support the staggered roll-out of new Point of Sale features to the store whilst still maintaining high availability.
Before I talk about the architecture, I want to cover a scenario first of the end state. Imagine there is a global retail company based in both North America and Europe. Both territories have around 1000 physical bricks and mortar stores. These stores each have a number of tills (cash registers for my American friends). There could be between 2 and 5 tills per store depending on its size. Each till communicates with systems hosted at a centralised location. These systems consist of web services, caching servers and databases. This has been illustrated in the diagram below.
This diagram shows 2 geographic regions. Each Region contains a head office network infrastructure and a store network infrastructure. Both of the global regions are completely separate from each other. There are no shared resources between the two.
A requirement for this architecture is that it be highly available. The stores should always have access to the services. The easiest thing to do is to put a load balancer in front of the web services and have multiple app servers. Then if any app server should go down or exhibit problems, the other load balancers can pick up the slack.
Although the App Servers described are load balanced and provide a degree of availability, if there is a problem/outage with the data centre where they are hosted, then you have a problem as no matter how many load balanced App Servers you have, if connectivity goes to the data centre then your retail stores are cut off from working. The way around this is to have co-location facilities in each region. This is shown in the diagram below.
Each global region has 2 locations. There are 2 ways you can run this configuration.
Active – Active: This is where you run both locations at the same time and load balance between the 2 of them. If one location goes down the other location picks up the traffic.
Active – Passive : In this configuration, you run one location as active and have all your traffic point to this location. The 2nd location is running cold and is only there if location 1 has problems.
My preferred option is to run Active – Active and always have both data centres running. If you run Active – Passive it is very easy for the 2nd location to get out of date if you are not disciplined. If this happens and you need to switch over to the 2nd location, you could cause a big outage as the centralised services may not match up with the client point of sale system in the store.
Regardless of whether you’re Active-Active or Active-Passive, you will need to keep the databases synced between the 2 locations. You do this with whatever replication technology is available to your database system. I am currently looking into SQL Server 2012 HADRON.
Typical Head Office Deployment
So far I have covered the logical and physical architecture with regards to high availability. But now I want to take a closer look at a typical deployment scenario within one of the locations. I work with Microsoft technologies so this is all swayed towards .NET. There is nothing particularly special about this set-up and it should look very familiar to most developers, but I want to just cover it off before talking about pre-release and beta testing in a retail environment.
The diagram below shows a typical services deployment scenario. I will start at the top and work my way down. In the top layer you have the databases used by your applications and systems. On the left I have the application databases and on the right the App Fabric monitoring database. The monitoring database is used monitor the performance and errors from the app servers.
App Fabric is a set of middle-ware technologies for Windows Server, released by Microsoft. It consists of two main features, Hosting and Caching. App Fabric features provide an easy way to deploy and manage Workflow and WCF services. App Fabric includes an extension of the IIS web server tool that enables IIS administrators to monitor the performance of services and workflows. App Fabric caching is an in memory, distributed cache that runs on one or more on-premise servers to provide a performance and scalability boost for .NET Framework applications. App Fabric caches store data in key-value pairs using the physical memory across multiple servers.
Below the database layer there is a caching layer. The example in the diagram references the App Fabric Cache, but you can use any kind of distributed in memory cache. The idea with this layer is to store any frequently accessed, but not often changed data so that the client can access this data quicker. The caching servers need to contain a lot of RAM as the data is cached in memory for speed of access.
Below the caching servers are the Application Servers. These are the servers that contain the Web Services accessed by the client point of sale application running on the store tills. In the example above these servers are Windows Servers running the IIS web server and Windows Server App Fabric. Each App Server is added into a Load Balancer group to distribute the load across the multiple servers. Because of this you need to ensure that each App Server contains identical web services. Whenever I set an environment up like this I always make sure I have an even amount of App Servers and plan for double the capacity that I need. This is to facilitate doing a Zero Downtime Deployment where during a deployment you temporarily switch down to half the server capacity. I explain how the Zero Downtime Deployment model works in my article about Composing Web Services into Layers.
Zero Downtime Deployment is a deployment mode where upon deploying software you can do so without causing an outage. The method described in the link above gives an easy method for performing a roll-back should you encounter a problem with a deployment.
Piloting and Beta Testing
An important requirement in a retail environment is to be able to pilot changes to your point of sale system. By this I mean when a new version of the POS is ready you want to gradually roll it out to all the stores in a safe and controlled manner. As an example you may want to pilot new features with 10% of the stores first. Then pilot with 25%, followed by 50%, and finally 100% full roll-out.
In this next section of the article I want to discuss some different deployment patterns that you can use for piloting. Each pattern has its pros and cons. Your choice of piloting pattern will depend on your circumstances, and budgets.
The examples demonstrated below are per location per region. So, for example, the European Region has 2 locations in separate data centres and the ideas below would apply to each region.
In the examples in this article you will have noticed that even though we use separate sets of app servers and cache servers for managing piloting and high availability, we only have a single set of instances for the databases. It is important that you use your particular database technologies high availability features to mitigate against any single point of failures. This may include things like clustering or sharding. Also when you design your database tables you need to ideally design the schema for extension and not modification, this is a little like the Open/Closed principal in object oriented design.
Single Instance to the Side
This is the simplest of the patterns. Here if you want to pilot versions of the Point of Sale system, you deploy the POS to the tills and then point them at a separate instance of the App Server at head office. You also want a separate instance of the memory cache server.
PRO : Easy to Set-up – Out of all of the patterns described here, this is the easiest to set-up.
CON : Creates a Single Point of Failure – By only having 1 app server for piloting on and several stores connected to that server, you are creating a single point of failure. If you have say 20 stores connected to that server and it goes down, then you have 20 stores unable to work.
Small Load Balanced Setup to the Side
This version builds up on the single server to the side model and creates a smaller load balanced environment at a reduced capacity to the main environment. This gives you a level redundancy in case one of the App Servers goes down.
PRO : Level of Redundancy – By using a couple of servers in a load balanced group you get a degree of redundancy in case one App Server goes down.
CON : Additional Complexity – This is more complicated to set-up and run compared to the single server version.
Master / Slave Environment
The Master / Slave environment is a much more complex environment to set up, but it gives you a much more flexible environment to work in with regards to high availability. This version maintains separate versions of the deployed App Server and Caching Servers. The original set of servers is called the Master and the 2nd set is called the Slave. All your retail stores are pointing towards the load balancer for the Master. As you roll out a pre-release version of the Point of Sale system, you deploy it to the Slave environment. Then you start gradually moving stores over to the Slave. Once all the stores have been migrated to the Slave environment, it becomes the new Master, and the previous Master becomes the Slave.
PRO : Level of Redundancy – By having a duplicate deployment environment in each location you are ensuring that your pre-release services are in an identical load balanced environment.
PRO : Slave becomes Master – Once you have completed the migration of tills over to the Slave environment, it becomes the new Master. This makes the deployment process much smoother to manage.
CON : Additional Complexity – This is a much more complex set-up to maintain as you need a duplicate of each App Server and Cache Cluster Group Per Location and Region. This doubles the size of the environment.
Canary Testing (Champion / Challenger)
The final model I want to present builds upon the Master / Slave scenario. In the previous version you would deploy your pre-release / piloting version to the slave set-up and then migrate stores over to the slave. When this is completed, the Slave would become the new master. In the Canary Testing scenario you deploy the new version to the slave as before, and then have a routing service that performs a champion challenger redirect to the desired version. Champion challenger lets you define a split in traffic, so that, for example, you can say 10% of the requests go to the Master and 90% goes to the Slave. Over time you adjust the split so that eventually 100% points to the slave. This works great provided the version of the client software on the tills can work with changes on both the Master and the Slave. If the Slave introduces breaking changes then this model will not work that well. The previous Master/Slave scenario above works better when there is a completely different version of the Client software.
PRO : Level of Redundancy – By having a duplicate deployment environment in each location you are ensuring that your pre-release services are in an identical, load balanced environment.
PRO : Champion / Challenger – Allows you to define the split in traffic going to both the Master and the Slave environment.
CON : Complexity – This is a much more complex set-up to maintain as you need a duplicate of each App Server and Cache Cluster Group Per Location and Region. This doubles the size of the environment. There is also additional complexity in using an A/B Router to split the traffic. You need to ensure this mechanism doesn’t introduce a single point of failure.
CON : Single Client – This only really works if the client software running on your tills can work with the services on the Master and the Slave, i.e. you haven’t introduced any breaking changes.
In this article I have talked about potential ways of introducing High Availability into a global retail store environment where you have many physical bricks and mortar stores. As with any solution, there are multiple ways to solve the same problem, and I am sure there are many ways to solve this particular problem, these are just the thoughts I have been having on it recently. If you have had to solve this kind of retail scaling and availability scenario, I would love to hear in the comments how you tackled it.