Part 2 – What’s going wrong with Rancher 1
In my previous post I talked about how from the beginning we arrived to use Rancher as main orchestration tool for our services.
So what is going wrong and why we decided to move to Rancher 2?
The answer is very simple, in Uala we are growing fast and our services continue to become ever bigger.
A few example:
– We started with about 20 containers, now we are running about 650 containers
– At the beginning our backend worked with about 1000 requests/minute, now we are at 7000+ rpm with an AVG response time of 100ms
– Our databases increase the dimension of about 10% every month
With the help of a great tool (sematext) we can monitor every container in our infrastructure and the health of every host.
Ok but, what is the main issue of rancher 1? Well, the agents.
Rancher, before kubernetes, needed to create a own system for manage containers and network beween hosts.
It created an orchestration engine called “Cattle” that works with an IPsec network between agents, installed on every hosts.
Containers, services and links between services and hosts work thanks to agents that convert DNS to IPs and allow to communicate between hosts.
All great right? Well… if all it works.
More we was growing and adding servers and services, more agents start to fail, with traffic that goes at intermittent, and with the results that some hosts cannot comunicate with others until you restart the network agent.
So actually, every our system is monitored not so much for a failure in a host or a service, but for wait when a network agent fails.
After some times, Rancher released version 1.6 with some changes to agents, and at the same time it started supporting kubernetes.
Why kubernetes if they already had their orcherstration engine? Because in the meanwhile kubernetes becomes the defacto standard for orchestration of containers, and rancher cannot be blind to the market.
Did we moved to rancher 1.6 and Kubernetes? No, we was in a period where we were releasing so much features that the infrastructure was not at the first priority, and our monitoring systems still work enough to garantee a 99,99% of uptime.
The most important reason because we decided to wait for Rancher 2 is that it’s not only “compatible with kubernetes” but it’s a complete tool that stay above kubernetes using ALL the k8s functionalities without intrude with other stuff (as instead rancher 1.6 does).
Now, after about 1 year of development, Rancher 2 became stable and usable, and we are ready to jump to k8s.