[idb_import] Broaden IDB import: function folders, variables, value locations, and type fidelity#8245
[idb_import] Broaden IDB import: function folders, variables, value locations, and type fidelity#8245ChrisKader wants to merge 16 commits into
Conversation
…ction When an IDB only records a section-relative base address (loading_base of zero, so we fall back to min_ea as a BaseSection), the rebase delta was computed against the lowest mapped *section* in the view. That over-shifts every imported address for formats where the first section starts after the file header. IDA's min_ea is the image base and maps the file header too, whereas a Mach-O's first section (__text) begins after the header and load commands. Aligning against the lowest mapped segment (segments include the header region) yields the correct delta and stops imported addresses from being shifted by the header size.
The "IDB Import refactor" introduced translate.rs (TILTranslator) as the type translator used by the mapper, but left the previous translator in types.rs behind. The module was never re-declared in lib.rs, so it has not been compiled or referenced since the refactor. Removing it drops a large block of dead code along with its stale TODOs; TILTranslator is now the single source of truth for IDB->BN type translation.
Resolve the outstanding translation TODOs in the TIL translator: - Size the variable-width C basic types (bool/short/int/long/long long/ long double) from the TIL header's compiler sizing info when a TIL is attached, falling back to the standard C ABI defaults. Both build_basic_ty and width_of_type now share these sizes so referenced-type placeholder widths stay in step with the types they stand in for. - Translate BoolSized to a real width: a 1-byte bool stays bool, any other width becomes an unsigned int of that size (BN bool is always one byte). - Honor pointer __ptr32 / __ptr64 modifiers to override the platform address size, and document that based/shifted pointers have no BN representation. - Detect variadic functions via the ellipsis calling convention instead of hardcoding has_variable_args to false. - Add udt extra_padding to the computed structure width so fixed-size UDTs occupy their true storage size. - Document the resolved design decisions for grouped (bitmask) enums, flexible array members, struct/union placeholder widths, the function return location, and the authoritative pointer address size.
- merged_types: carry an ordinal across the dedup when the kept entry lacks one, keeping name/ordinal lookups resolvable, and document that dir_tree types are clones of the same TIL definitions so no body merge is needed. - TIL decompression: read_til already inflates Zlib/Zstd sections via its section header, so document that and drop the stale "decompress til" TODO. - Function registers/stack variables: replace the dead exploratory block with a note scoping it as a follow-up feature (needs FunctionInfo and mapper support to apply named stack variables and register names).
- Populate IDBInfo.sha256 from the input file SHA256 recorded in the IDB so it is no longer always None, and drop the stale placeholder comment. - Mapper logs the recorded SHA256 and documents a future IDB verifier that would compare it against the mapped view before applying data. - Define the fallback `size_t` only when the view lacks one, so a real platform/view definition is never clobbered. - Document that the undo bracketing requires the mapper to be the sole writer, an invariant the run-once loader activity already guarantees. - Document the name-based (not range-based) section dedup rationale: the BN loader already maps the address space, so a range check would suppress every IDA segment. - Replace the remaining design-question TODOs (used-type ordering, attached TIL lookup, per-function platform tuple, OpenFileName filter naming) with decisions/notes explaining the current behavior and future direction.
The IDB records the SHA256 of its original input file. Walk to the root of the view's parent chain (the raw view, whose bytes are that on-disk file), hash it in 1 MiB chunks, and compare against the recorded hash. On mismatch we warn that the imported data may not correspond to the binary; on match we log the verification at debug level.
- Argument locations: translate IDA stack-passed argument locations (ArgLoc::Stack) into Binary Ninja parameter stack locations so explicit stack parameter placement is preserved. Register-encoded locations carry raw IDA register indices with no portable BN mapping and are left for analysis to derive. - Register variables: parse IDA "regvars" (a register renamed by the user within a function) into FunctionInfo, carrying them through the function merge, and apply them in the mapper by resolving the register by name and creating a user variable typed to the register width.
Parse each function's stack frame (named locals, saved registers and stack arguments) from the IDB along with its geometry (frsize/frregs), carry it on FunctionInfo through the function merge, and apply it in the mapper. IDA records the frame as a structure running from the bottom of the locals upward; Binary Ninja measures stack offsets from the return address, so an IDA frame offset is shifted down by local_size + saved_regs_size. Member offsets are the running sum of preceding member widths (the frame members carry no explicit offset), and the synthetic saved-register/return-address members are skipped while still advancing the offset. Variables are created as auto stack variables typed from their translated IDB types.
Two pieces of IDB data were parsed but never applied to the view: - is_no_return: mark functions IDA flags as non-returning (abort/exit/etc.) with set_auto_can_return(false) so analysis does not fall through calls to them. - Local labels: IDA's in-function named locations were folded into the name list, where map_name_to_view skips anything inside code. Route them through the dedicated map_label_to_view so they land as local-label symbols.
IDA lets users organize functions into folders in the Functions window, stored as a dirtree. Parse that hierarchy (preserving nested folders, not just the leaf functions) into FunctionFolderEntry, and recreate it in the view as Binary Ninja components: each folder becomes a component nested under its parent, and every function leaf is added to its folder's component. Functions sitting at the dirtree root are left uncomponented, matching their "no folder" state in IDA.
Expose the processor's register names (indexed by IDA register number) from the database and hand them to the type translator along with the architecture. Argument locations encoded as registers (Reg1, the Reg2 register pair, and register-relative RRel) are now resolved through those names into Binary Ninja registers and emitted as value locations, in addition to the stack locations already handled. Forms with no equivalent (distributed, static, custom) still fall back to the calling convention.
Function folders in Binary Ninja's symbol list are backed by the component API (the docs describe creating them "automatically via the API", linking to binaryninja.component.Component), so the component approach is correct. Improve the mapping so it does not depend on analysis having indexed the functions yet: capture the Ref<Function> returned when each function is created and key it by rebased address, then place those into folders directly (falling back to a view lookup only when needed). Add a summary log line reporting how many folders were created, how many functions were placed, and how many could not be found, and align terminology to "folder" to match the UI while the underlying type stays a component.
Reuse the register/stack location resolver to honor a function's explicit return location (function retloc) when the database records one, attaching it to the BN return value at full confidence. Functions without an explicit return location, or whose location cannot be resolved, keep the calling-convention-derived return as before.
A segment that covers the exact same address range as an existing section is the same region under a possibly different name, so skip it rather than add a duplicate. We still avoid an overlap-based check, which would wrongly suppress every segment because the loader maps the whole address space.
The lowest-segment rebasing fix is format agnostic; reword its comment so it no longer reads as Mach-O specific. The first section starting after the format headers (Mach-O load commands, PE headers, ELF program headers) is a general property, and aligning to the lowest segment matches IDA's image base regardless of format.
idb-rs now parses a bare unknown type (unspecified size) instead of erroring, so handle it here: a zero-width unknown has no integer representation, so map it to void rather than constructing a zero-width int.
|
This is great thank you for the PR! |
|
Is there anything you need me to do? I am trying to transition to BN from using IDA Pro as my main RE tool so I expect more PRs to come for this plugin. |
|
Nope! We will have this reviewed and go from there, thanks for the PR! |
emesare
left a comment
There was a problem hiding this comment.
There are a few issues, if you do not have time to address them I can take over, thank you for your time.
| // over-shifts every imported address by the header size. Segments include the | ||
| // header region, so the lowest segment matches IDA's image base regardless of | ||
| // format. | ||
| let bn_segment_addr = view |
There was a problem hiding this comment.
In all of the binaries I have tested, this is not the case, for example, loading this binary you will see with the change to use segments we no longer have a correct base address to map with.
If you can provide a sample that proves otherwise we can discuss how best to go about solving this for both scenarios.
| } | ||
| } | ||
|
|
||
| // TODO: The below undo and ignore is not thread safe, this means that the mapper itself |
There was a problem hiding this comment.
There is a reason this was left as a TODO instead of a NOTE, we do not consider this solved and we would like to fix this behavior, not leave it as-is.
| if let Ok(_used_types) = til_translator.used_types.lock() { | ||
| used_types = _used_types.clone(); | ||
| } | ||
| // TODO: Adding types to view after the types have been applied to the functions is not a |
There was a problem hiding this comment.
Again, there is a reason this was left as a TODO.
| } | ||
|
|
||
| // NOTE: A further lookup source would be the TILs attached to the IDB (e.g. imported | ||
| // library TILs) before giving up; not yet wired in. |
There was a problem hiding this comment.
Again, there is a reason this was left as a TODO.
| // analysis does not fall through past calls to them. | ||
| if func.is_no_return { | ||
| tracing::debug!("Marking function as no-return: {:0x}", func.address); | ||
| bn_func.set_auto_can_return(false); |
There was a problem hiding this comment.
Not entirely sure this will be persisted, will need to save a database with a function this applies with and check.
| let func = functions_by_address.get(&rebased).cloned().or_else(|| { | ||
| view.functions_at(rebased) | ||
| .iter() | ||
| .find(|func| func.start() == rebased) |
There was a problem hiding this comment.
What is the point of this? functions_at queries based off the start address of the function.
| pub exports: Vec<ExportInfo>, | ||
| /// Processor register names indexed by IDA register number, used to resolve the registers | ||
| /// referenced by argument/return value locations into Binary Ninja registers. | ||
| pub register_names: Vec<String>, |
There was a problem hiding this comment.
If this is indexed by IDA register number the assumption here is there are no gaps and the index starts at 0, I think it would be useful to add that invariant as a comment for future reference.
| /// components. | ||
| #[derive(Debug, Clone, Serialize)] | ||
| pub enum FunctionFolderEntry { | ||
| Function(u64), |
There was a problem hiding this comment.
Use function::Location in places where we expect to "unique" a location for a function, since our definition of a function location is rooted in the absolute address and its architecture.
| Ok(TypeBuilder::bool()) | ||
| } | ||
| Basic::Bool if self.bool_size == 1 => Ok(TypeBuilder::bool()), | ||
| // Binary Ninja's `bool` is always a single byte; a wider `bool` is represented as an |
There was a problem hiding this comment.
We need to instead make an NTR to int with a more descriptive name then, as the comment suggests.
| // A `__ptr32` / `__ptr64` modifier overrides the platform address size for this pointer | ||
| // (e.g. a 32-bit pointer embedded in an otherwise 64-bit binary). | ||
| // | ||
| // NOTE: `closure` (based pointers) and `shifted` pointers have no direct Binary Ninja |
There was a problem hiding this comment.
Based pointers do exist in Binary Ninja, although we do not expose simple constructors for them and you must use the TypeBuilder::set_pointer_base function.
Substantially expands what the IDB importer brings over from an IDB/I64 and applies to the Binary Ninja database, plus correctness and type-translation improvements. Verified end-to-end against a large arm64 database.
New data imported
Type translation fidelity
Integrity & correctness
Cleanup
types.rstranslator (superseded by the active translator) and resolves the plugin's outstanding TODOs.