Wed 16 April 2025
1. Understanding Android App Obfuscation
In the Android ecosystem, obfuscation serves as an important defense mechanism for application developers, protecting intellectual property and sensitive code from reverse engineering attempts. By making reverse engineering harder, it poses a block for casual attackers.
While it is a hardening measure, obfuscation creates significant challenges for security researchers attempting to identify vulnerabilities. We decided to look into this as we push Ostorlab automation to go beyond technical vulnerabilities research to automation of logical vulnerability discovery.
Obfuscation transforms clear, readable code into deliberately complex and cryptic versions while preserving functionality. When developers obfuscate their code, they're essentially removing semantic meaning while preserving syntactic correctness - the code still works, but becomes nearly impossible to understand.
Consider this simple example:
// Original code
public void validateUserCredentials(String username, String password) {
if (isValidUsername(username) && isCorrectPassword(password)) {
grantAccess();
} else {
denyAccess();
}
}
// Obfuscated code
public void a(String b, String c) {
if (d(b) && e(c)) {
f();
} else {
g();
}
}
While functionally identical, the obfuscated version obscures the purpose and logic of the code, making it challenging to understand during security analysis.
Modern Android obfuscation involves several techniques and tools:
- ProGuard, the traditional open-source tool, has been integrated into Android builds for years, providing basic name obfuscation and code shrinking.
- R8, Google's modern replacement for ProGuard, offers enhanced optimization capabilities and seamless integration with the Android build system, now serving as the default compiler in Android Studio.
For applications requiring stronger protection, commercial tools might provide more advanced features including string encryption, control flow obfuscation, and anti-debugging measures.
2. Prevalence of Obfuscation in Android Ecosystem
The use of obfuscation in Android applications is widespread and growing, particularly among high-value applications and those handling sensitive data.
Research from "A Large Scale Investigation of Obfuscation Use in Google Play" (2018) found approximately 25% of all Android apps employ some form of obfuscation. The study analyzed 1.7 million free Android apps and discovered that when focusing on only the most popular applications, this figure jumps to around 50%.
A more recent paper (February 2025, "An Empirical Study of Code Obfuscation Practices in Android Apps") documented a 13% increase in obfuscation adoption from 2016 to 2023, with a particularly significant 28% rise among top developers during that period. Our analysis shows a significant increase in obfuscation adoption covering nearly 62% of tested applications.
The prevalence of obfuscation varies significantly across application categories. Financial and banking apps show nearly universal adoption, gaming apps frequently employ aggressive obfuscation to protect monetization algorithms and anti-cheat mechanisms, and enterprise apps implement moderate protection strategies to safeguard proprietary business logic. Social media platforms are increasingly adopting sophisticated obfuscation to protect user data handling mechanisms.
ProGuard/R8 remains the most widely used solution due to its integration with Android Studio and zero cost. However, commercial solutions show strong presence in high-value applications. Custom obfuscation solutions are increasingly common in apps handling particularly sensitive data, such as financial services and healthcare applications.
3. Effect of Obfuscation on Security Analysis
Obfuscation presents substantial challenges for both manual and automated security analysis, creating an interesting security paradox.
For human analysts examining code, obfuscation removes all the contextual clues that typically aid understanding. Without meaningful names, researchers must painstakingly trace through execution paths to understand what each function does. Consider a banking app's authentication system - what might normally be identified as verifyUserCredentials()
becomes simply LIZIZI()
, hiding its security-critical nature.
Control flow is deliberately convoluted, with extra jumps, conditional branches, and unnecessary indirection. String constants that might provide hints about functionality are encrypted, and component relationships become nearly impossible to discern.
A security researcher might spend days if not weeks deciphering an authentication flow that would take hours in unobfuscated code.
Automated analysis tools suffer even more dramatically. Taint analysis systems track how data flows from sensitive sources (like user input) to dangerous sinks (like SQL queries). When examining SQL injection, a tool normally identifies methods like executeQuery()
as potential sink points. With obfuscation, these methods become unrecognizable - LIIZZ()
could be anything from logging to database access. This fundamentally breaks the core mechanism of these tools.
For example, a common security analysis pattern is identifying methods that access sensitive system resources. Consider this code for accessing location:
// Original code
LocationManager locationManager = (LocationManager) getSystemService(Context.LOCATION_SERVICE);
Location location = locationManager.getLastKnownLocation(LocationManager.GPS_PROVIDER);
// Obfuscated version
Object a = b(c.d);
Object e = ((f) a).g(f.h);
Automated tools looking for location access patterns would completely miss the obfuscated version, potentially overlooking privacy issues.
Similarly, fuzzing tools need to understand which inputs affect security-critical paths. Obfuscation conceals these relationships. Static analysis systems that match known vulnerability patterns fail when those patterns are changed by obfuscation.
This creates a paradoxical security situation. Obfuscation effectively deters casual attackers with limited resources, but also hinders legitimate security analysis. Organizations may mistakenly believe obfuscation has eliminated vulnerabilities rather than merely concealing them. Security audits become significantly more expensive and time-consuming, sometimes leading companies to reduce their frequency or scope.
4. Beating Obfuscation with DalvikFLIRT
4.1 Understanding FLIRT Technology
FLIRT (Fast Library Identification and Recognition Technology) represents a novel approach to code analysis originally developed for native binary analysis in tools like IDA Pro. At its core, FLIRT enables the identification of standard functions in compiled code regardless of naming.
The technology creates distinctive signatures based on code patterns rather than relying on easily-changeable identifiers like function names. These signatures capture the essential structure and behavior of functions, making them resistant to basic obfuscation techniques.
Traditionally, FLIRT has been used primarily for native code analysis (C/C++), where it helps identify standard library functions in compiled binaries. FLIRT signatures encapsulate multiple distinctive characteristics of each function, creating a fingerprint that remains recognizable even when names and simple code patterns have been altered.
Original Function | Obfuscated Function
---------------- | -------------------
Function name: strcmp | Function name: fn_0x3A72
Signature: [Pattern of bytes/opcodes] | Signature: [Identical pattern]
Parameters: 2 string pointers | Parameters: 2 string pointers
Structure: Loop comparing characters | Structure: Loop comparing characters
Return: Integer (0 if equal) | Return: Integer (0 if equal)
4.2 Dalvik Properties for Signature-Based Identification
Dalvik bytecode, the format used by Android applications, presents unique opportunities for signature-based identification that differ from traditional binary analysis. Even after aggressive name obfuscation, the structural relationships between methods and fields typically remain intact, as changing them would break functionality.
Method signatures (parameter types and return values) must be preserved to maintain compatibility with the Android framework and other components. Access modifiers (public, private, protected) generally cannot be changed without altering application behavior, and inheritance relationships often remain unchanged to preserve functionality.
In some cases, debug information like source file names and line numbers may be retained, providing additional identification points. Most importantly, calls to Android framework APIs must remain consistent, creating recognizable patterns that serve as anchors for analysis.
These persistent properties create an opportunity for FLIRT-like signature matching specifically tailored to the Dalvik environment.
###### Class android.support.v7.internal.app.AppCompatViewInflater (android.support.v7.internal.app.AppCompatViewInflater)
.class public Landroid/support/v7/internal/app/AppCompatViewInflater;
.super Ljava/lang/Object;
.source "AppCompatViewInflater.java"
# static fields
.field private static final LOG_TAG:Ljava/lang/String; = "AppCompatViewInflater"
.field private static final sConstructorMap:Ljava/util/Map;
.annotation system Ldalvik/annotation/Signature;
value = {
"Ljava/util/Map",
"<",
"Ljava/lang/String;",
"Ljava/lang/reflect/Constructor",
"<+",
"Landroid/view/View;",
">;>;"
}
.end annotation
.end field
.field static final sConstructorSignature:[Ljava/lang/Class;
.annotation system Ldalvik/annotation/Signature;
value = {
"[",
"Ljava/lang/Class",
"<*>;"
}
.end annotation
.end field
# instance fields
.field private final mConstructorArgs:[Ljava/lang/Object;
.method private createView(Landroid/content/Context;Ljava/lang/String;Ljava/lang/String;)Landroid/view/View;
.registers 9
.param p1, "context" # Landroid/content/Context;
.param p2, "name" # Ljava/lang/String;
.param p3, "prefix" # Ljava/lang/String;
.annotation system Ldalvik/annotation/Throws;
value = {
Ljava/lang/ClassNotFoundException;,
Landroid/view/InflateException;
}
.end annotation
.prologue
.line 143
sget-object v3, Landroid/support/v7/internal/app/AppCompatViewInflater;->sConstructorMap:Ljava/util/Map;
invoke-interface {v3, p2}, Ljava/util/Map;->get(Ljava/lang/Object;)Ljava/lang/Object;
move-result-object v1
check-cast v1, Ljava/lang/reflect/Constructor;
.line 146
.local v1, "constructor":Ljava/lang/reflect/Constructor;, "Ljava/lang/reflect/Constructor<+Landroid/view/View;>;"
if-nez v1, :cond_36
.line 148
:try_start_a
invoke-virtual {p1}, Landroid/content/Context;->getClassLoader()Ljava/lang/ClassLoader;
move-result-object v4
if-eqz p3, :cond_43
new-instance v3, Ljava/lang/StringBuilder;
invoke-direct {v3}, Ljava/lang/StringBuilder;-><init>()V
invoke-virtual {v3, p3}, Ljava/lang/StringBuilder;->append(Ljava/lang/String;)Ljava/lang/StringBuilder;
move-result-object v3
invoke-virtual {v3, p2}, Ljava/lang/StringBuilder;->append(Ljava/lang/String;)Ljava/lang/StringBuilder;
move-result-object v3
invoke-virtual {v3}, Ljava/lang/StringBuilder;->toString()Ljava/lang/String;
move-result-object v3
:goto_21
invoke-virtual {v4, v3}, Ljava/lang/ClassLoader;->loadClass(Ljava/lang/String;)Ljava/lang/Class;
move-result-object v3
const-class v4, Landroid/view/View;
invoke-virtual {v3, v4}, Ljava/lang/Class;->asSubclass(Ljava/lang/Class;)Ljava/lang/Class;
move-result-object v0
.line 151
.local v0, "clazz":Ljava/lang/Class;, "Ljava/lang/Class<+Landroid/view/View;>;"
sget-object v3, Landroid/support/v7/internal/app/AppCompatViewInflater;->sConstructorSignature:[Ljava/lang/Class;
invoke-virtual {v0, v3}, Ljava/lang/Class;->getConstructor([Ljava/lang/Class;)Ljava/lang/reflect/Constructor;
move-result-object v1
.line 152
sget-object v3, Landroid/support/v7/internal/app/AppCompatViewInflater;->sConstructorMap:Ljava/util/Map;
invoke-interface {v3, p2, v1}, Ljava/util/Map;->put(Ljava/lang/Object;Ljava/lang/Object;)Ljava/lang/Object;
.line 154
.end local v0 # "clazz":Ljava/lang/Class;, "Ljava/lang/Class<+Landroid/view/View;>;"
:cond_36
const/4 v3, 0x1
invoke-virtual {v1, v3}, Ljava/lang/reflect/Constructor;->setAccessible(Z)V
.line 155
iget-object v3, p0, Landroid/support/v7/internal/app/AppCompatViewInflater;->mConstructorArgs:[Ljava/lang/Object;
invoke-virtual {v1, v3}, Ljava/lang/reflect/Constructor;->newInstance([Ljava/lang/Object;)Ljava/lang/Object;
move-result-object v3
check-cast v3, Landroid/view/View;
:try_end_42
.catch Ljava/lang/Exception; {:try_start_a .. :try_end_42} :catch_45
.line 159
:goto_42
return-object v3
:cond_43
move-object v3, p2
.line 148
goto :goto_21
.line 156
:catch_45
move-exception v2
.line 159
.local v2, "e":Ljava/lang/Exception;
const/4 v3, 0x0
goto :goto_42
.end method
Understanding typical Android obfuscation patterns is essential for effective signature development. The most widespread technique is identifier renaming, where classes, methods, and fields receive meaningless names like single letters or random strings.
5. How DalvikFLIRT Works
DalvikFLIRT applies the signature-matching concept to the Android ecosystem with important adaptations for Dalvik bytecode specifics.
5.1 Generate a Signature of a Class
DalvikFLIRT generates signatures for classes by collecting a wide range of attributes. Each class signature includes structural information (name, superclass, access flags, interfaces), source code metadata when available, fields with their types and modifiers, string constants from both fields and methods, and detailed signatures for each method. The system also calculates a unique fingerprint hash combining the most distinctive features of the class.
For methods, the signatures are even more detailed, capturing method descriptors, access flags, bytecode hashes, instruction counts, API calls, constants, opcode distributions, and control flow graph characteristics.
This multi-faceted approach creates a rich signature that captures both structural and behavioral attributes, remaining recognizable despite obfuscation.
{
"classes": {
"Lcom/example/myapplication/ComposableSingletons$MainActivityKt$lambda-1$1;": {
"name": "Lcom/example/myapplication/ComposableSingletons$MainActivityKt$lambda-1$1;",
"superclass": "Lkotlin/jvm/internal/Lambda;",
"access_flags": 16,
"interfaces": [
"Lkotlin/jvm/functions/Function3;"
],
"source_file": null,
"fingerprint": "2f1b337d2e6092b6775b2c9059222361",
"fields": [
{
"name": "INSTANCE",
"type": "Lcom/example/myapplication/ComposableSingletons$MainActivityKt$lambda-1$1;",
"access_flags": 25
}
],
"field_strings": [],
"string_constants": [
"innerPadding",
"com.example.myapplication.ComposableSingletons$MainActivityKt.lambda-1.<anonymous> (MainActivity.kt:22)",
"Android",
"C22@889L139:MainActivity.kt#ptgicz"
],
"methods": {
"<clinit>()V": {
"name": "<clinit>",
"descriptor": "()V",
"access_flags": 65544,
"bytecode_hash": "2552c2aefad67c2d666996fc73085a74",
"instruction_count": 4,
"api_calls": [
21
],
"constants": [],
"opcode_distribution": {
"34": 1,
"112": 1,
"105": 1,
"14": 1
},
"cfg": {
"basic_blocks": 1,
"edges": 0,
"exception_handlers": 0
},
"registers": 1
},
"<init>()V": {
"name": "<init>",
"descriptor": "()V",
"access_flags": 65536,
"bytecode_hash": "0e486b2e5a245ee4b0e601cc7ddbbce9",
"instruction_count": 3,
"api_calls": [
61
],
"constants": [
[
"numeric",
3
]
],
"opcode_distribution": {
"18": 1,
"112": 1,
"14": 1
},
"cfg": {
"basic_blocks": 1,
"edges": 0,
"exception_handlers": 0
},
5.2 Computing Resemblance Metrics
When comparing an obfuscated class against signature databases, DalvikFLIRT employs a weighted similarity calculation. Different components contribute to the overall score based on their resilience to obfuscation:
METHOD_WEIGHTS = {
"bytecode_hash": 0.0,
"descriptor": 0.25,
"constants": 0.25,
"opcode_distribution": 0.3,
"cfg": 0.2,
}
CLASS_WEIGHTS = {
"metrics": 0.2,
"superclass": 0.1,
"interfaces": 0.1,
"fields": 0.1,
"methods": 0.5,
}
Method descriptors are highly reliable because parameter types and return values rarely change without breaking functionality. String constants often survive basic obfuscation, providing semantic clues. The opcode distribution creates a statistical fingerprint of the method's behavior, while the control flow graph structure reflects the logic patterns that must be preserved for correct operation.
At the class level, class metrics (like method and field counts), inheritance relationships, and interface implementations all contribute to identification. The weighted approach means DalvikFLIRT can recognize matches even when some aspects have changed, provided enough distinctive characteristics remain.
5.3 The tool in Action
In real-world testing, we applied DalvikFLIRT to popular applications, some with over 250,000 classes. The following output shows a sample match:
DalvikFLIRT Simplified Signature Matcher
Signatures1: tests/files/app-debug_signatures.json
Signatures2: tests/files/Media_APKPure_signatures.json
Output: matches_media.json
Threshold: 0.8
Loading signatures from tests/files/app-debug_signatures.json
Loaded 12288 class signatures
Loading signatures from tests/files/media_APKPure_signatures.json
Loaded 279796 class signatures
The matcher identified that the obfuscated class LX/i7X
corresponds to Landroidx/concurrent/futures/AbstractResolvableFuture
with 83.9% confidence:
Class Score Breakdown (LX/i7X; vs Landroidx/concurrent/futures/AbstractResolvableFuture;):
- Metrics : Score=0.451, Contrib=0.090 (Weight=0.2)
- Superclass : Score=1.000, Contrib=0.100 (Weight=0.1)
- Interfaces : Score=1.000, Contrib=0.100 (Weight=0.1)
- Fields : Score=1.000, Contrib=0.100 (Weight=0.1)
- Methods : Score=0.898, Contrib=0.449 (Weight=0.5)
Total Class Score: 0.8390
This identification works despite complete name obfuscation. For instance, the method named addListener
in the original code was renamed to LIZJ(Ljava/lang/Runnable; Ljava/util/concurrent/Executor;)V
in the obfuscated version. DalvikFLIRT correctly matched them based on their parameter types, control flow, and opcode patterns.
6. How About the Rest of the Code: LLM-Powered Code Rewrites
While DalvikFLIRT excels at matching known components from the Android SDK and common libraries, it won’t work for custom application code that has no known reference signature. This is where we leverage the power of Large Language Models to provide a complementary approach.
6.1 LLM-Powered Rewriters
Modern large language models have remarkable capabilities to understand, generate, and transform code based on patterns they've learned from vast code repositories. We've discovered that these LLMs can effectively reverse many types of obfuscation by inferring meaning from context and structure.
The key insight in our approach is reframing the problem. Rather than asking models to "deobfuscate" code (which many refuse to do due to potential misuse concerns), we present the task as "improving code readability" - a subtle but important distinction that allows models to engage with the task:
deobfuscator_agent = Agent(
large_model,
system_prompt=(
"You are an expert Java developer. Given hard to read Java source code, "
"your task is to: \n"
"Rewrite the code to make it more readable.\n"
"Please make sure to rewrite the whole code and DO NOT omit any function or method.\n"
"Improve the quality of the code. Focus on renaming variables, methods, "
"and classes to be more descriptive names based on their context and usage. "
"Maintain the original code's functionality. If no context is provided, "
"do your best to rewrite based on the code alone."
"Feel free to rewrite control flow to make the code easier to understand."
"Do NOT change string constants as these might be used by other apps."
),
)
6.2 The Multi-Agent System
Our implementation uses a coordinated multi-agent system where specialized components work together to progressively improve understanding of the code:
- The Deobfuscation Agent rewrites obfuscated code for clarity while preserving functionality, focusing on providing meaningful names and simplifying complex control structures.
- The Explainer Agent provides plain-language descriptions of what code does, identifying inputs, outputs, and key behaviors. This helps provide context for further analysis.
The Context-Building Agent integrates information from already-identified components (like those matched by DalvikFLIRT) to inform analysis of unknown sections.
Rather than operating independently, these agents share information, creating a feedback loop that progressively enhances understanding as more code is processed.
# Initialize the AI agent for deobfuscation
deobfuscator_agent = Agent(
# model,
large_model,
system_prompt=(
"You are an expert Java developer. Given an hard to read Java source code, "
"your task is to: \n"
"Rewrite the code to make it more readable.\n"
"Please make sure to rewrite the whole code and DO NOT omit any function or method.\n"
"and improve the quality of the code. Focus on renaming variables, methods, "
"and classes to more descriptive names based on their context and usage."
"DO NOT keep short or cryptic variable, classes and method names. "
"Maintain the original code's functionality. If no context is provided, "
"do your best to rewrite based on the code alone."
"Feel free to rewrite control flow to make the code easier to understand."
"Do NOT change string constant as these might be used by other apps."
),
)
# Initialize the AI agent for code explanation
code_explainer = Agent(
model,
system_prompt=(
"You are an expert code explainer. Analyze the provided source code and explain it in plain English. "
"Be thorough and cover: \n"
"1. The code's purpose and functionality\n"
"2. Inputs it expects\n"
"3. Outputs it produces\n"
"4. Edge cases and error handling\n"
"Use clear, simple language suitable for developers of all levels."
),
result_type=CodeExplanation,
)
6.3 Prompt Engineering for Optimal Results
The effectiveness of our LLM-based approach depends heavily on crafting the right prompts. We've developed several techniques that consistently produce better results:
Rather than directly mentioning "obfuscation" or "deobfuscation," we frame the task as improving readability of "hard-to-read" code. This subtle shift avoids triggering safety filters in models.
We provide context from Java and Kotlin API, unobfuscated libraries and DalvikFLIRT-identified components as reference points. For instance, if we know the application calls an identified encryption method, that knowledge helps the LLM correctly interpret the surrounding code.
Our process starts with simpler methods that have fewer dependencies, then progressively tackles more complex components. This allows the LLM to build context gradually, using information from earlier analyses to inform later ones.
Clear instructions about what to preserve (functionality) and what to enhance (names, structure) help guide the model toward useful transformations rather than just reformatting.
Here's an example of what our system can achieve:
// Original obfuscated code
public static void a(long j, String str, int i, Object obj) {
if (obj != null && c.e(str) && i >= 0) {
new Thread(new d(j, str, i, obj)).start();
}
}
// LLM-deobfuscated code
public static void logEventAsync(long userId, String eventName, int eventType, Object eventData) {
if (eventData != null && ValidationUtils.isValidEventName(eventName) && eventType >= 0) {
new Thread(new EventLoggingTask(userId, eventName, eventType, eventData)).start();
}
}
The deobfuscated version not only provides meaningful names that reveal the method's purpose, but also infers the likely meaning of the component classes based on how they're used.
6.4 The Recursive Bottom-Up Analysis
We've found that combining DalvikFLIRT's signature matching with LLM-powered rewrites works best in a recursive bottom-up approach. Here's how the process unfolds:
First, we construct a complete call graph showing how all methods in the application relate to each other. This mapping reveals which methods call which others, creating a dependency hierarchy.
We then identify "leaf methods" - those that don't call other methods. These simpler functions make ideal starting points for analysis, as they have fewer dependencies.
With our initial set of methods deobfuscated (some by DalvikFLIRT signature matching, others by LLM analysis), we move up the call tree to methods that use these now-understood functions. The previously analyzed methods provide valuable context for understanding what these higher-level functions are doing.
As we continue up the call tree, we build a progressively richer understanding of the application's behavior. Each new method benefits from the context of all previously analyzed methods it calls.
Think of it like archeology - we start by understanding the simplest artifacts, then use that knowledge to interpret more complex ones, gradually reconstructing the entire civilization from the ground up.
The following code shows how a method is analyzed in this context-aware system:
def analyze_method(self, node: MethodNode) -> Dict:
"""
Perform full AI analysis on a method including explanation, deobfuscation and security audit.
"""
# Get source code
class_source_code = self.methods_engine.get_class_source(node)
method_source_code = self.methods_engine.get_method_source(node)
# Get context from callees (methods this one calls)
callee_info = self.methods_engine.get_callees(node)
# Deobfuscate the code using both DalvikFLIRT and LLM approaches
deobfuscated_code = self.deobfuscate_code(
class_source_code,
method_source_code,
context=callee_info
)
# Generate a human-readable explanation
explanation = self.explain_code(deobfuscated_code, node)
# Store results for use as context in analyzing methods that call this one
return {
'source_code': method_source_code,
'deobfuscated_code': deobfuscated_code,
'explanation': explanation,
'security_issues': security_audit,
}
By combining these techniques - DalvikFLIRT's signature-based identification for known components and LLM-powered code rewriting for custom application logic - we've created a system that can effectively deobfuscate even heavily protected Android applications.
Conclusion and Future Directions
Our dual approach combining DalvikFLIRT's signature-based recognition with LLM-powered rewrites creates a powerful system for bypassing obfuscation in Android applications. This approach gives us unprecedented capabilities to understand what would otherwise be impenetrable code.
The key strength of our methodology is automatically identifying Android SDK components with high confidence, then using them as contextual anchors to generate high-quality code for application-specific logic. The system handles multiple obfuscation techniques simultaneously, from simple name obfuscation to more complex control flow manipulation.
The bottom-up recursive approach means the system becomes progressively more effective as it analyzes more code, each new insight improving subsequent analyses.
This work isn't about undermining security but advancing automation capabilities. By making obfuscated code comprehensible, we enable more thorough security analysis and vulnerability detection.
We do newsletters, too
Get the latest news, updates, and product innovations from Ostorlab right in your inbox.
Table of Contents