The Hard Part Was Deciding What to Automate First

AI

The Hard Part Was Deciding What to Automate First

The ideal expense workflow is one you barely notice.

An expense is captured automatically. The right policy is applied. The supporting evidence is checked. Obvious cases move through without anyone needing to think about them. A person only gets involved when something genuinely needs judgement.

I hate spending time submitting expenses. Nobody wants to spend time approving obviously valid ones either.

That’s the end goal.

But you do not get there by giving an LLM the entire workflow on day one and hoping for the best. That usually gives you mediocre results. Bad for the product and bad for the customer.

In the previous post, I wrote about how our first prompt had too many jobs. It tried to reason about too many different things at once, which made the output harder to trust and the system harder to debug. We split that prompt into smaller checks with clearer responsibilities.

That solved one problem and exposed another:

Which decisions is the system actually ready to own?

That required more product thinking than prompt engineering.

The current version is deliberately an assistance layer. It brings the right information to reviewers and helps them focus on the expenses that may need attention. Over time, the goal is to automate more of the workflow as the system earns enough trust.

The destination is touchless. The path is incremental.

Start with the workflow, not the model

An expense workflow looks simple until you sit down with the people who actually review expenses.

Someone submits an expense. A reviewer checks whether it appears to comply with policy. The supporting evidence needs to line up with what was submitted. The expense moves through an approval flow. Eventually, it is reimbursed or paid.

On paper, that’s straightforward.

In practice, reviewers make a lot of small judgements along the way.

Does this fall within the relevant policy? Is the timing reasonable? Is the explanation sufficient? Does the supporting evidence match the claim? Is there a legitimate reason for an exception? Does anything look unusual enough to deserve a closer look?

Some of those questions are mechanical. The reviewer is applying a straightforward rule or comparison rather than making a judgement call.

Some require interpretation.

Some are reasonable for a system to own today. Others need to remain with a human until the system is reliable enough to take on more responsibility.

The mistake would have been to treat all of them as equivalent.

Rather than starting with the model, we started with the workflow. For each judgement, we asked a few basic questions:

  • Can this be answered deterministically?
  • Do we have the data needed to make the decision?
  • What happens if the system gets it wrong?
  • How often does this case occur?
  • Is the system making a decision, or surfacing something for review?

That last question matters because there is a big difference between:

This expense should be rejected.

and:

This expense may need a closer look.

For the first version, we were building a system to help reviewers focus their attention. We were not trying to remove the reviewer before the system had earned that trust.

The longer-term goal is slightly different though. Routine cases should eventually move through without anyone needing to look at them. Human attention should be reserved for the cases where judgement is genuinely useful.

Does this actually need an LLM?

Once we mapped the workflow, one question kept coming up:

Does this actually need an LLM?

Sometimes the answer was yes.

Existing company policies are written for humans, not machines. They may contain exceptions, vague wording, and rules that only make sense in context. The general shape is often similar, but the details can vary significantly between companies.

Supporting evidence can also be messy. The system may need to reason about incomplete information or work out whether the available evidence plausibly supports what was submitted. That’s where a model can add value.

But sometimes the answer was no.

Some questions are better handled with ordinary code. If a rule, lookup, or comparison can answer the question reliably, that is normally the better tool. It is cheaper, faster, easier to test, and easier to explain. Most importantly, it is more predictable.

The goal is not to maximise the number of AI calls. The goal is to use model judgement where it added value and keep ordinary logic where ordinary logic is enough.

That sounds obvious when written down.

It is less obvious when you are building a product with LLMs and every problem starts to look like an opportunity to call the model again.

Different judgements need different treatment

One useful distinction was between policy interpretation and verification.

A policy question asks:

Does this expense appear to follow the company’s rules?

A verification question asks:

Does the available evidence support what was submitted?

Those questions sound related, but they are not interchangeable.

A policy concern might mean an expense appears to fall outside a company rule. A verification concern might simply mean that the available evidence was not sufficient to confirm the submission.

That difference matters to the person reviewing the expense. It also matters to the product. The explanation should be accurate, the next step should make sense, and the system should not imply more certainty than it actually has.

This is a small example of a broader point: once you stop treating the workflow as one large AI problem, you can choose the right treatment for each judgement.

  • Some need (non-deterministic) model interpretation.
  • Some need ordinary deterministic code.
  • Some should surface information without making the final call.
  • Some can eventually become fully automated.

Automation has to earn trust

The end goal for us is a touchless experience.

If an expense is captured correctly, supported by the available evidence, and clearly within policy, there is little value in asking a person to click approve.

But the system needs to earn the right to make that decision.

The first version focuses on helping reviewers. It surfaces relevant information, identifies potential issues, and reduces the amount of manual investigation required. A human still makes the final call.

That gives us a safer way to learn.

We can observe where the system is reliable, where it is too cautious, where it produces noise, and where the underlying data is not good enough yet.

The next step is not to replace the reviewer as a blanket rule. It is to progressively remove routine work from the reviewer’s queue.

Start with the obvious cases. Escalate the ambiguous ones. Expand the touchless path as confidence improves.

That’s a more credible route to automation than pretending every judgement is equally ready on day one.

False positives spend trust

One of the easiest ways to make an AI review system feel broken is to flag too much.

A model can be technically defensible and still be annoying.

If every ambiguous expense becomes a warning, reviewers learn to ignore the system. Employees feel like they are being accused of something every time the evidence is slightly imperfect. Then the product becomes a very expensive way to generate noise.

So we adopted a simple principle:

When the available information is ambiguous, do not overstate the conclusion.

That does not mean ignoring obvious problems. It means distinguishing between clear evidence and uncertainty.

Sometimes the right output is not:

This breaks policy.

It is:

We could not verify this from the information available.

Sometimes the right output is no flag at all.

This is not just a prompt-engineering decision. It is a product decision. You are choosing how much friction users should experience and how much trust you are willing to spend.

That trust matters even more if the long-term goal is greater automation. A system that constantly cries wolf will struggle to earn the right to make more decisions on its own.

Shipping less can be the correct decision

Not every idea worked well enough on the first attempt.

Some checks sounded straightforward when described in a planning document but became much harder once they met real data.

Real-world inputs are messy. Names vary. Evidence is incomplete. Two expenses can look similar for entirely legitimate reasons. Something that feels obvious to a human can be surprisingly difficult to encode reliably.

Dates can be especially messy in payment workflows. The transaction date, authorisation date, settlement date, and receipt date do not always line up neatly.

When a check generates too much noise, the answer is not always to add another paragraph to the prompt. Sometimes the answer is to remove it from the workflow, improve the evaluation set, and come back later.

That’s a normal engineering decision.

We disable features when they are not reliable enough. We reduce scope when the edge cases are not understood. We do not keep shipping something merely because it contains AI.

The AI part does not change that.

Closing

The hardest part of building the agent was not getting an LLM to return an answer. Models are very good at returning answers. The harder question was deciding which decisions the system was ready to own.

The current version helps reviewers work faster. It surfaces the relevant information, narrows the queue, and reduces the amount of manual investigation required.

But that’s not the final destination.

Nobody wants to spend time submitting expenses. Nobody wants to spend time approving obviously valid ones. The long-term goal is a workflow where routine cases move through without human effort and people only get involved when their judgement is genuinely useful.

The destination is touchless expense management.

The engineering work is earning the trust to get there.

About the author

Alex Aitken

Alex Aitken

Alex is a software engineering leader focused on AI, data, and product engineering at Airwallex. His opinions are his own.

Leave a comment

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.