Problems and pressures

Why control software becomes harder to change over time

Control software tends to become harder to change over time because of how it is traditionally built. Code is written line by line in vendor-specific environments, often under commissioning pressure that prioritizes the fair-weather scenarios over the rare ones. Design intent lives in separate documents that drift out of sync. Edge cases surface in production after the team has moved on. The engineers who understood the original logic move on or retire. What starts as a system that mostly works gradually becomes a system nobody fully understands. Each change carries unpredictable risk, debugging takes longer than writing new code, and teams begin to manage symptoms rather than fix root causes. This is not a people problem. It is an architectural one.

The fragile state begins on day one

The standard mental model of legacy software is that systems start clean and degrade over time. The reality is more uncomfortable. Even the day a system goes live, parts of it are already fragile. Documentation is rarely complete at commissioning. Testing covers the scenarios the team thought of, not the ones the system will actually encounter. The bad-weather cases (rare combinations, edge conditions, operator mistakes that nobody simulated) show up only after the system has been running long enough to find them.

When those cases surface, they often surface during operations. Production is running. Downtime costs money by the minute. Engineers do not have the luxury of designing the cleanest fix; they have the time to make the fix that works fastest. Multiply that pattern across twenty years and the codebase is full of fixes that solved one problem and created the conditions for the next.

Each fix introduces a new dependency, often invisible. The patch added in year three to handle a sensor anomaly becomes a load-bearing assumption that the patch in year seven inadvertently violates, which produces the symptom that the patch in year eleven is trying to address. None of the engineers involved did anything wrong. The architecture simply made it impossible to see the connections.

This is why fragility cannot be fixed by hiring better engineers or writing more documentation. The architecture treats every fix as a code change, and code changes accumulate dependencies that nobody can fully trace.

The lifecycle pattern: clean, working, fragile, frozen

Most control software passes through four states in its lifecycle, and the transitions between them are predictable.

Clean. The system is new. The original engineers wrote it, they understand the design, and the documentation is current. Changes are confident. Bug rates are low. This phase typically lasts one to three years after commissioning.

Working. Field-discovered exceptions and operator requests have introduced changes that were not in the original spec. Each change made sense at the time, made by engineers who understood the local situation. The cumulative effect is harder to see. The documentation is updated when there is time, which is rarely. The original engineers are still around, but a few have moved on. Changes are still mostly confident, but each one starts with a slightly longer "let me check what depends on this." This phase lasts another three to five years.

Fragile. Documentation has fallen behind the running code. Original engineers have rotated off the project. The codebase contains layers of patches that nobody fully traces. Engineers making changes start with "let's understand what's there" and that step now takes longer than writing the new logic. Bug rates rise. Changes are made carefully, then more carefully, then deferred. This phase can last a decade.

Frozen. The team manages around the system rather than changing it. Operators have workarounds. Maintainers have tribal rules: "do not touch state machine X," "always restart in this order." The system still runs, but it cannot be improved. Eventually it has to be replaced, often at much higher cost than the changes that were avoided.

The progression is not a moral failure of the engineering team. It is the predictable consequence of code-centric architecture interacting with the real-world conditions of long lifecycles, staff turnover, and accumulated change.

The root causes

Three structural causes drive the progression, and they reinforce each other.

Code-centric architecture. The code is the system. Every other artifact (requirements, design, tests) is a description of the code, maintained separately, kept in sync by hand. As soon as one of those artifacts falls behind, the code is the only thing that can be trusted, and the code is also the artifact whose intent is hardest to recover.

Document drift. Documentation is updated under operational pressure, and operational pressure is usually pointed at the running system, not the documents. Two years in, the documents describe a system that no longer exists. Five years in, the documents are referenced sceptically. Ten years in, they are treated as historical context, not authoritative reference.

Knowledge in heads. The deep understanding of why a system behaves the way it does lives with three or four engineers. They know the workarounds, the field-discovered exceptions, the reasons for the apparent inconsistencies. When they leave, that knowledge leaves with them. Their successors inherit a system whose intent has been compressed into the syntax of an outdated programming language.

Why patches and refactoring rarely fix it

The instinct, when a code-centric system gets fragile, is to patch it: cleaner abstractions, better naming, structured comments, refactoring sprints. These help on the margins. They do not change the architectural cause, which is that the design intent and the running system are kept in different artifacts.

A refactoring sprint that improves the readability of the code without recovering the design intent has lengthened the lifecycle of the fragile state, not solved it. The next engineer still has to start with "let's understand what's there." The code is just slightly less hostile to that effort. Five years later, the patches have accumulated and the system is fragile again.

This is why so many automation modernization decisions look binary in practice: live with the existing system or replace it entirely. The middle path (continuous improvement of a code-centric stack) does not change the fundamental architecture, so it does not change the fundamental trajectory.

Why root cause analysis fails in code-centric systems

When something goes wrong in a code-centric control system, root cause analysis is supposed to follow a specific sequence: observe the symptom, gather data about what the system was doing, identify the failing component, fix the underlying issue. In practice, that sequence rarely completes.

The data is incomplete. Most legacy systems were not instrumented to capture every PLC cycle. They captured selected signals, at selected intervals, into a historian that was sized for normal operation. When something rare happens, the data needed to understand it was not stored. The team has snapshots, not the full record.

The intent is invisible. The code shows what the system did. It does not show why it was supposed to do that. The engineer trying to diagnose the symptom has to reconstruct intent from code, often code written by someone who is no longer at the company.

The pressure is wrong. Most diagnoses happen during downtime, on the factory floor, with operations waiting. The fix that works fastest gets deployed. The deeper fix, the one that addresses the underlying cause, requires time and tooling that nobody has at that moment.

Without complete behavioral data, without visible intent, and without time, root cause analysis turns into pattern matching: "we saw something like this in 2019, the fix was X." Sometimes the fix works. Sometimes it solves a different problem with the same symptom and creates the conditions for the next incident. The team gets better at managing the symptoms and worse at fixing the system, because the architecture does not give them what they need to do anything else.

The architectural alternative: model as source of truth

The alternative is not better code. It is a different architecture in which the intent and the implementation are not separate artifacts. Model-centric control software places the design intent (requirements, structure, behavior, validation) in one connected model and treats the deployable code as a generated by-product. The model is the artifact engineers maintain. The code is regenerated when the model changes.

The architectural shift changes the trajectory. Document drift cannot occur, because there are no separate documents to drift; documentation is a view of the model. Knowledge that would otherwise live in heads can be captured as model elements, with rationale and traceability attached. New engineers inherit the model, not just the code; the design intent is right there.

What "safe to change" actually requires

Safe-to-change is not a function of better processes. It is a function of three properties of the system:

  • Visible intent. Engineers can see why the system is the way it is, not just what the code does. The model carries the intent.
  • Predictable impact. A change to the model surfaces the dependent artifacts (other model elements, generated code, test cases, runtime monitors). Impact analysis is a query, not a guess.
  • Verifiable behavior. The model can be simulated before deployment, and runtime behavior can be projected back onto the model after deployment. The link between intent and observed behavior is direct.

A system with these three properties stays changeable for decades. A system without them drifts toward the frozen state regardless of how disciplined the team is.

How Cordis SUITE makes change safe

Cordis SUITE has been built on the model-centric architecture since 2000, but the architecture alone is not what makes change safe. What makes change safe is the combination of the model and the runtime evidence that links back to it.

Every PLC cycle is captured. Not selected signals at selected intervals. The full behavioral record. When something happens in production, the engineer does not have to reconstruct what the system was doing from incomplete logs. The complete record is there.

Snapshots can be replayed offline. The state of a running system at any moment can be captured and brought back to the engineering environment. The engineer opens the snapshot in the same dashboard they use for design, plays back the live state machines as they actually executed, and watches the path through the model that produced the observed behavior. Root cause analysis happens in the office, not on the factory floor. Without operational pressure, with the full behavioral record visible, with the model right there to compare against.

Intent is captured in the model, not buried in code. When the engineer finds the root cause, they are looking at a model element with rationale, traceability, and links to the requirement that originated it. The "why" is visible alongside the "what." The fix becomes a model edit, with the platform showing every dependent artifact that needs to regenerate.

Live state machines can be played back. The behavior of any subsystem can be replayed at any speed, against the same model that defined it. Discrepancies between specified and actual behavior become directly visible.

This is the practical difference between a code-centric stack and a model-centric platform. Both can run the same plant. Only one of them lets the engineering team understand what happened, find the root cause, and fix the underlying issue before the next incident. The architecture does not just slow the drift toward fragility. It changes what "making a fix" means.

The cost of doing nothing

The cost of staying in the code-centric pattern compounds across an asset portfolio. Each system is one engineering team's problem in the short term and a portfolio-level liability in the long term. The cost shows up as:

  • Modernization budgets that are larger than they should be, because behavior has to be rediscovered from scratch.
  • Operations budgets that grow because the team manages around the system instead of fixing it.
  • Risk exposure during incidents, because the system that nobody fully understands is also the system that has to be diagnosed under time pressure.
  • Compliance cost, because audit trails depend on documents that no longer match the running code.

The shift to a model-centric architecture is not free either. It requires investment in the platform and in capturing existing behavior. The honest comparison is the lifecycle cost of one approach against the lifecycle cost of the other, summed across an asset portfolio across decades.

The underlying choice is architectural. A code-centric system is built to run. A model-centric system is built to run and to be understood. The first will eventually become a system nobody can safely change. The second remains changeable for as long as the asset itself runs. The cost of staying with the first is not visible in any single year. It is visible across decades, in the systems that get replaced not because they failed, but because nobody trusted them anymore.

About Cordis SUITE

Model-centric since 2000.

Cordis SUITE has been a model-centric platform for industrial control since 2000. The model is the maintained artifact, code is generated as a by-product, and design intent stays linked to the running system. Family-owned and used across national infrastructure, water, energy, and high-tech production, where systems must stay changeable across decades.

An unhandled error has occurred. Reload x

Rejoining the server...

Rejoin failed... trying again in seconds.

Failed to rejoin.
Please retry or reload the page.

The session has been paused by the server.

Failed to resume the session.
Please retry or reload the page.