A permanent priority: operational resilience in 2023______

Michael Davies, Global Head of Operations

We are all working in a world that feels increasingly volatile, surrounded by both known and unforeseeable risk. From cybersecurity to geo-politics and the economic trends affecting financial markets, fund managers are more aware than ever of the external factors that could cause stress to their systems. That recognition is being reinforced by regulators, who now insist that organisations take responsibility not just for the stability of their own operations, but ensuring that those of their suppliers and counterparts meet the same standard.

It’s no surprise, therefore, that operational resilience is a topic we find ourselves discussing more frequently with customers. Quite rightly, customers want to be reassured that the systems on which they process trades are robust and can deal with the volatility arising from major events.

Compliance in this area is also becoming more stringent. The ISO 22301 standard under which we are certified specifies not just that a business continuity plan should exist, but that it be regularly updated, effectively communicated, and consistently evaluated.

The industry’s increasing focus on operational resilience is something we welcome, and which feeds into the internal dialogue at Calastone about how we can best protect our systems and guard against risks to our service. There are two overarching dimensions to this: the way we run, manage and monitor our technical infrastructure; and the steps we take to ensure our teams are ready to respond effectively in case of an incident arising.

Systems: well-diversified, continuously monitored

On the technical side, our resilience is based on the principle of diversification. Calastone’s platform runs an active/active environment from a hybrid of physical data centres that are located some distance apart, and cloud based infrastructure. As such, the system is unlikely to be affected by the same interruption of service in the event of a major disruption or power outage. Each data centre is capable of running the system independently and managing a full day’s traffic should one of the others fail. We also run regular failover tests where at least one data centre is taken offline to check that the system will run smoothly without it. All this helps to ensure uptime for our fully digitalised funds network, whose own automated systems are one of our strongest safeguards against disruption.

A second cornerstone of operational resilience is monitoring. We have real-time feedback in place across all facets of our production environment, not just to alert our engineers in the event of an incident or outage, but to sound the alarm about unusual activity that may indicate a problem is imminent. If, for example, a server CPU is running at an excessively high level for longer than we would normally expect, that will trigger an engineer to troubleshoot the issue and intervene appropriately. This ensures that the majority of action taken is pre-emptive, with nascent gremlins in the system identified and resolved before they ever threaten to impact a customer.

Our monitoring also extends to the behaviour of client systems, which allows us to identify when they might be experiencing a problem or abnormality of any kind. We have benchmarked the expected performance and responsiveness of our highest volume clients and will reach out when we see this performance drop below the expected threshold. It is not uncommon for our automated checks to identify unusual patterns of behaviour in a customer system with which we interact, sometimes before it has been noticed.

This underlines how a spirit of co-operation is essential to ensuring operational resilience right across the value chain. The emphasis is always on ensuring that the end investor is unaffected, regardless of where and when a problem may arise. Bilateral processes and safeguards should be there to pick up the slack and overcome points of failure. For example, our use of timestamps ensures trades can be booked accurately for an ordering party even if there is an issue in a transfer agent’s system or at the point where it connects to the Calastone platform. Resilience is achieved as much through emphasising how systems work together, and stress testing the intersections, as it is by any individual party.

People: whenever, wherever

Of course IT systems are only as robust as the people who write their code and administer them. That has always been the case, but as working patterns change, plans to ensure people are available to respond to incidents have had to evolve in turn.

Every organisation experienced this during the Covid lockdowns, which brought business continuity and operational resilience to the forefront. At Calastone that meant implementing a readiness plan for fully remote operation, which we triggered before government guidelines, allowing our people to work from home with appropriate technology and secure access to systems.

Alongside fully remote working is a widely distributed team. With our increasing global footprint, our organisation is structured with employees across multiple timezones, ensuring that engineers are available around the clock to respond to issues as they arise.

Training and scenario planning form another key component. We regularly run incident rooms that simulate how a team would respond to a systemic disruption or attack, giving our people the chance to practise their own role and the way they communicate and work together. The intention is not just to rehearse the key technical tasks, but to remind teams that an incident must be run in a holistic way, with overall management of resources and client communication as important as technical troubleshooting and fixes. We also hold regular forums with a wider group to discuss additional situations we might encounter and risks on the horizon that may not yet have been fully planned for.

The work of operational resilience is continuous and our guiding philosophy is to make improvements little and often, whether testing a new technology solution or updating an existing incident response plan. The nature of a constantly evolving threat landscape means that business continuity can never be a series of plans gathering dust, but must be an active playbook that is familiar to all, regularly revisited and constantly revised. Operational resilience must be a permanent priority to avoid cracks emerging that can put systems at risk of exploitation or failure. 

The greater emphasis on business continuity in recent years is of benefit to everyone. The more that organisations question each other about operational resilience, the higher the regulatory bar, and the more rigour that goes into developing and testing disaster responses, the better prepared each company and the industry as a whole will be. Not every incident can be pre-empted and not every problem that arises will have been planned for: by its nature, some risk is impossible to predict. But by investing in robust systems and prepared people, we collectively stand the best chance of dealing with whatever the future may bring. 

Featured articles