Ostorlab: Mobile App Security Testing for Android and iOS

Author

Abir Jelti

Thu 25 June 2026

Why agentic harnesses are becoming critical for real mobile security testing

Give a language model a mobile app and ask it to test security.

The answer may look familiar: insecure storage, weak cryptography, exposed secrets, excessive permissions, vulnerable SDKs, backend issues, authentication flaws, privacy risks.

It may be structured like a report. It may reference the right categories. It may read like something a security team could send around.

The app has still not been opened.

No package was inspected. No permission was checked against runtime behavior. No SDK traffic was observed. No storage location was reviewed. No backend call was followed. No crash state was captured. No evidence changed hands.

The report existed. The target had not changed state.

A Reasoning Engine With No Hands

Most conversations about AI in security still start with the model: which one is smarter, which one reasons better, which one has the longer context window, which one performs best on benchmarks.

In a real security workflow, the model is only one part of the system.

A raw model can understand instructions, generate hypotheses, describe attack paths, and decide what should happen next. But testing a target requires contact with the target. The app has to be unpacked. Permissions have to be reviewed. SDKs have to be identified. Storage has to be inspected. Traffic has to be observed. Authentication flows have to be tested. Backend calls have to be followed. Noisy leads have to be discarded. The findings that survive have to be backed by something more durable than a paragraph.

Around the model sit the tools, context, prompts, skills, memory, execution loops, and feedback that let those steps happen.

When the Model Can Touch the App

A model sees “sensitive data storage” and can describe what should be checked: shared preferences, local databases, cache files, keychain use, encryption, logs. The words are correct. The app is still a black box.

In a harnessed workflow, the app is unpacked. Storage locations are inspected. Configuration files are opened. Runtime behavior is observed. A value either appears on disk or it does not. A concern either picks up evidence or fades out.

The same thing happens with SDKs.

A model can warn that third-party SDKs may introduce privacy exposure. The warning is familiar enough to sound useful. But the SDK has not been identified. Its network destinations have not been observed. Its permissions have not been compared with actual traffic. No payload has been inspected.

Then the question changes. It is no longer “could this be risky?” It becomes: what is present, what runs, what leaves the app, and what evidence is attached to it?

Anthropic has shown the same pattern in long-running agent work: the model did not improve in isolation. The workflow around it changed too — prompts, tools, context management, feedback loops. OpenAI’s harness engineering writing points to the same kind of work around agents: acceptance criteria, validation, missing tools, guardrails, documentation.

Those examples are about software engineering. In mobile security, the same pattern appears around the artifact, the runtime, the tool output, and the next test the model is able to run.

A polished answer can still leave the app untouched.

A Toolbox Is Not a Harness

There is a lazy version of agentic security: give the model a pile of tools and call it an agent.

Too much raw output enters the context. Too many tools are available too early. Scanner results arrive without enough interpretation. Weak signals become confident findings. A suspicious string becomes a secret. A permission becomes a privacy violation. An unusual endpoint becomes an exploitable backend issue.

A useful harness is quieter. It decides what the model sees first, which tools belong to the next step, what should be remembered, and what level of evidence is required before something becomes a finding.

In mobile security, almost every signal needs that context. A permission is not automatically a privacy issue. A third-party SDK is not automatically malicious. A stored value is not automatically sensitive. A suspicious request is not automatically exploitable.

A Permission Is Not a Finding

Take location access.

A model can explain why location access may create privacy risk. That is useful background. It is not a finding.

The permission has to be checked at runtime. The SDKs have to be identified. The traffic has to be observed. The destination domains have to be reviewed. The payloads have to be inspected. The app’s declared purpose has to be compared with what the app actually sends.

Only then does the question become useful.

When is location collected? Which component collects it? Where does it go? Is it sent with identifiers? Is it tied to an analytics SDK, an ad SDK, a backend endpoint, or a feature the user actually triggered?

From the outside, the app looked like one thing. Under inspection, it becomes layers: code, permissions, SDKs, storage, traffic, backend calls, and platform behavior.

The category was visible from the beginning. The finding only appears after the evidence moves.

GEF Makes the Harness Concrete

The same pattern is easier to see in native exploitation.

A general-purpose LLM can describe the steps: reverse the native library, inspect unsafe arithmetic, build a crash trigger, look at registers, refine a primitive. The methodology can be correct while the target remains untouched.

In one Android JNI case, the model was connected to GEF, an exploitation tool built around GDB.

The agent reversed the native library and found a 64-bit-to-32-bit integer truncation in an image-size calculation. The truncated value controlled the heap allocation size. The original 64-bit value controlled the copy length.

The allocation was smaller than the copy.

Input was crafted. The crash reproduced. An indirect function pointer was overwritten with a recognizable 64-bit marker. The marker appeared in the crash state.

The primitive was refined into a controlled native function call with an attacker-controlled first argument. Execution was redirected to Android’s native library loader. A purpose-built shared library ran inside the application process.

The logs showed execution. The library’s initialization routine created proof artifacts.

The path did not stop at “possible heap overflow.” It moved through the target: calculation, allocation, crash, control, call primitive, loader, execution, proof.

The model provided the reasoning. GEF kept the reasoning attached to the running process.

The Work Has to Be Reviewable

Once an agent can act, another question appears.

What did it inspect? Which tool output changed the conclusion? What evidence was collected? What assumptions were made? Where did the investigation stop?

In offensive security and mobile testing, a finding is useful only if someone can understand how the system reached it. If an agent reports sensitive data exposure, the reviewer needs the path: the app behavior, the storage location or request, the payload, the affected data, and the reasoning that connects the evidence to the finding.

If the agent decides not to report something, that decision needs a path too. Maybe the permission was declared but never used. Maybe the SDK was present but inactive. Maybe the endpoint looked unusual but did not expose sensitive behavior.

The same applies to exploitation. If the agent claims code execution, the reviewer should not be left with a sentence that says “exploitation succeeded.” The chain of evidence should be visible, from the vulnerable calculation to the runtime proof.

Without that visibility, the output becomes difficult to defend. It may be correct, but the team has no trail. It may be wrong, but it may still arrive with enough fluency to look convincing.

A harness records what happened: permissions, tool boundaries, approval points, evidence logs, reproducible steps, and stopping conditions.

The agent investigates. The team still needs to see the investigation.

From Report-Shaped Text to Defensible Testing

The model is no longer only listing what should be checked. It is working through app evidence, tool outputs, runtime behavior, traffic, dependency signals, exploitation traces, and reproducible steps. The result of one step changes the next step.

The output also changes shape. Instead of a cleaner checklist, there is a path through the target. Instead of a plausible paragraph, there are artifacts a reviewer can inspect. Instead of “this may be risky,” there is a chain that can be challenged.

A security team can open the artifact. Follow the trace. Replay the steps. Inspect the runtime proof.

Not a model that sounds like a pentester.

A trace that can survive one.

Tags:

Mobile Security, Ostorlab, AppSec, Security Automation, agentic harness, AI Security

Table of Contents

Why agentic harnesses are becoming critical for real mobile security testing
A Reasoning Engine With No Hands
When the Model Can Touch the App
A Toolbox Is Not a Harness
A Permission Is Not a Finding
GEF Makes the Harness Concrete
The Work Has to Be Reviewable
From Report-Shaped Text to Defensible Testing

The App Was Never Opened