It was an important milestone for the company. As a Technology Group, we had to prepare the platform and the product for a hyper-growth phase.
We decided to focus our technology roadmap around 4 streams:
The ability to operate the platform smoothly is critical.
In order to improve operability, we started building monitoring and alerting infrastructure. The outcome was a combination of CloudWatch(logs and metrics), Prometheus and Grafana. Critical alerts have been sent to a Slack channel, making it easier for everyone to be alerted when incidents happen.
Monitoring not only helped us in diagnosing issues but also in gaining a better understanding of the runtime behaviours of our systems. Thanks to the increased visibility of internals, we could start tuning JVM and V8 virtual machines with significant performance gains. Improved garbage collector configuration of the Elasticsearch cluster enabled consistent performance even with an overall small number of data nodes. Thanks to a better awareness of memory consumption across the border, we reduced our infrastructure footprint with significant optimization in our running costs.
We also started leveraging better the AWS ecosystem, moving from Beanstalk to ECS and increasing utilisation efficiency.
Throughput and latency
Our content pipeline ingests 3 million documents a day. We want to surface content in near real time with no more than few seconds delay. In order to achieve that result consistently, we started measuring documents latency along the ingestion pipeline. The metric (measured as the oldest message in a queue at sampling time) helped us identifying bottlenecks and areas of architectural improvements.
We also started considering throughput a first class citizenship when designing our system architecture. Horizontal scalability and adequate use of parallelism helped us designing a performing system.
Stability is important to both guarantee adequate service to our users as well as to avoid daily firefighting.
To increase stability, we swapped the system backbone from HTTP to a mix of queues (SQS) and streams (Kinesis). The integration pattern guaranteed that we wouldn’t lose any document in the case of an incident.
Improved resilience also enhanced the ability to better deal with sudden spikes; we introduced elastic behaviours to core services, enabling the scale in/out based on queues depth and load balancer response time.
As it could often happen for startups, we had a snowflake infrastructure where several components were created directly from the AWS consolle. That approach was not only error prone, but also no longer scalable. We introduced Terraform to our stack and embraced Infrastructure as code. Our infrastructure is now version controlled, and people can collaborate and benefit from higher level abstractions represented by different Terraform modules. The process is now also repeatable; if we lose any infrastructure component we can quickly recover it by simply executing the relevant terraform apply.
In summary, two common threads emerged as critical for a round A stage trough working on the above mentioned streams:
- Operational repeatability
- The right metrics are key to success
While you grow the ability to have repeatability in the software operations will help maintain an overall small and lean team. If you want to move fast smooth operability is a key component to maintain speed to market and agility. Metrics are also critical; without the right metrics it’s hard to know what’s going on in the system, hence impossible to decide where the focus should be placed and how much improvement is required.
Thanks to Miguel Martinez-Alvarez for the feedback on the draft of this article.