Overview of Mach-O binary

Hacking 101

Mach-O is a binary format used by Apple for its systems. The binary format contains assembled bytes, data and other information. Structured by a list of load commands, where each load command hold the neccessary pointers to the contents.

At offset 0 lies a header structure, struct mach_header, containing the general information about the the binary.

struct mach_header {
    uint32_t magic;
    cpu_type_t cputype;
    cpu_subtype_t cpusubtype;
    uint32_t filetype;
    uint32_t ncmds;`
    uint32_t sizeofcmds;
    uint32_t flags;
};

Valid magic values are: 0xfeedface for 32-bit format, 0xfeedfacf for 64-bit format, little endian wise, big endian version are bytes swapped.

cputype and cpusubtype declare on which platform can this binary be loaded (or which assembly type this file contains). The most we are gonna see are x86, x86_64, and arm64, arm64e; while arm 32-bit aka armv7 (armv7s, armv7a) exist, Apple dropped support for these platforms since the release of iPhone 6.

filetype denotes the type of binary, executable, dynamic library, object file.

ncmds and sizeofcmds declare the number of load commands and the total size of load commands in byte. The reason why the size is required because the size of load command types varied. Also, the sizeofcmds is checked upon binary load, and throws error if it’s incorrect.

flags is bit mask value for extra information, e.g. PIE.

Load command

Each load command is structured, type of cmd, command size cmdsize, and information of that command.

There are many types of load commands, however we only focus on segment, dynamic library, symbols, fairplay, codesignature command types.

Segments are common in executable/library binaries. These point to the data inside where the .text or .data reside. In Mach-O binaries, a segment load command is followed by a series of sections, with each section mark the start/end of the data. The common sections are: __text, __cstring, __const, __got, __la_symbol_ptr, __mod_init_func, __data, __bss. These sections can be named without any restrictions, however compilers often name them by a rule of thumb. The attributes for the sections is marked with bit mask flag, indicating the attributes of the items.

A unique segment with no section is named __LINKEDIT. This section points to the last part of the binary containing various information, including tables of symbols, tables of symbols name, list of exported symbols, and binary’s signature.

Each dynamic library is registered through a load command containing the path to the library. The path can either be absolute or relative. Absolute path resolving is straight-forward. With relative path resolving, the binary can use either of the two forms: relative to current directory, or rpath. Relative path with the current directory is easy to understand, ./, ../ and such paths are valid in this case.

rpath is a little different, in short, the path started with either these variables: @executable_path, @loader_path, @rpath. @executable_path is replaced with the executable’s residing folder, @loader_path is replaced with the path of the folder containing the loader. @rpath is resolve by rpath load commands.

The Mach-O binary can possess many load command to denote the rpath, each of the item must be an absolute path, or relative path, or using @executable_path or @loader_path or @rpath. It is unclear whether rpath can be stacked, but as a rule of thumb, we should not use @rpath on rpath load command. A common rpath often used by Apple is @executable_path/Frameworks, which can be seen on iPhone/iPad application binaries compiled using Xcode.

Fairplay encryption is a mechanism designed by Apple to encrypt the app content with the device private key, such that you cannot run the app from another machine. The Mach-O binary always have a load command pointing to the section starts and end, and the encryption status.

Due to Apple design of the fairplay, we can’t recover the key to decrypt. However, we can actively dump the binary on memory, as it must be decrypted before running. Another method involves using the Apple mmap for fairplay encrypted region. These should be discussed on [[Fairplay]].

Codesignature is present on signed binary, using codesign with a distribution or development key. The sections tells us many informatin regarding the signer, and hashes. The signature is encoded in a PKCS#7/CMS with SignedData encoded in BER of ASN.1 (X.609). It also contains the list of certificates in X.509 format, and the signature digest. Currently Apple is using RSA to sign its binary.

The binary must be signed with a certificate chain root as Apple CA, otherwise Apple devices reject installation. Apps distributed through the Apple Store is also signed by Apple Store and device distribution certificate. For self-signed binary, the Apple CA is still the root certificate, while the children are developer certificate.

Symbols are encoded as a series of bytecode, a load command is specified to mark the region of symbols. This command registers the placement of non lazy, lazy, exported symbols. non lazy symbols are searched and written into the got table when the binary is loaded, lazy symbols are searched through plt, export symbols are indexes/addresses into the function start.

non lazy and lazy symbols are encoded as bind opcode; export symbols are encoded as a prefix-trie. More detailed about these in [[Linker Info]].

The above paragraph states the current situation of Mach-O symbols encoding. However, a few years ago, this was not the case. Few years back (don’t know when), they have a list of symbols and dynamic symbols in sperated commands. Thus in the newver version of Mach-O, they have a command id as, LC_DYLD_INFO_ONLY, which shows that it should not be used with the legacy list anymore. Loader crashes if this command is used with an non-empty list of (dynamic) symbols.

The Mach-O related structures can be found and read on Apple’s cctools modules at include/mach-o/loader.h.


When code is compiled for use on systems running the Mach kernel (macOS, iOS, etc.), this code is organized using the Mach object (or Mach-O) file format. An executable format determines the order in which code and data in a binary file are read into memory. Code organized under this format includes compiled programs along with files with the .o, .dylib and .bundle extensions.

A Mach-O file consists of three major regions — a header, load commands, and segments . Segments contain one or more sections where each section contains code or data of different types. Segments start on page boundaries, sections not necessarily aligned. Convention is to name segments in uppercase prefixed by two underscores (e.g., __TEXT), sections in lowercase prefixed by two underscores (e.g., __text). For paging purposes, the header and load commands are considered part of the first segment — in executable, means they live at the start of the __TEXT segment as that is the first segment containing data (__PAGEZERO contains no data and not readable/writeable).

Header

Structure identifying file as a Mach-O executable. Contains general information about file.

struct mach_header {    
	unsigned long magic; /* Mach magic number identifier */
    cpu_type_t cputype; /* cpu specifier */
    cpu_subtype_t cpusubtype; /* machine specifier */
    unsigned long filetype; /* type of file */
    unsigned long ncmds; /* number of load commands */
    unsigned long sizeofcmds; /* size of all load commands */
    unsigned long flags; /* flags */
};

Load Commands:

Variable size commands that specify the layout and linkage characteristics of of the file. Can specify initial layout of the file in virtual memory, location of symbol table, initial exec state of main thread, names of shared libraries for imported symbols. Load Commands:

  • __PAGEZERO segment load command
  • __TEXT segment load command
  • __DATA segment load command
  • __LINKEDIT segment load command
  • DYLD_INFO_ONLY segment load command: specify internal structure of __LINKEDIT segment, give size and offset of symbol export trie and some bytecode interpreted by OSX dynamic linker
  • SYMTAB segment load command: symbol table (list of nlist_64 structs). Largely vestigial, but string table still used.
  • DYSMTAB load command: specifies offset of the indirect symbol table.
  • LOAD_DYLINKER: specifies location of /usr/lib/dylib
  • UUID Load Command: unique identifier for the executable.
  • VERSION_MIN_MACOSX load command: minimum version of OS X compatible with the executable (10.13.0).
  • SOURCE_VERSION load command: version of the source code used to generate the executable.
  • MAIN load command: offset of the __main function in the file
  • LOAD_DYLIB load command: one LOAD_DYLIB load command for every library to which the executable is dynamically linked.
  • FUNCTION_STARTS load command: offset and size of the function starts segment. Used by tools to determine if a given address falls inside a function. Formatted as a zero-terminated sequence of DWARF-style ULEB128 values. The first value is the offset from the start of the __TEXT segment to the start of the first function. The remaining values are offsets to the start of the next function from the previous function.
  • DATA_IN_CODE load command: offset and size of a segment which records the locations of certain pieces of data that are inlined in the __TEXT segment.

Segments

  • __PAGEZERO: One full VM page (4096 bytes or 0–0x1000) located at 0 with no protection rights assigned, which causes any accesses to c NULL to crash. With no data contained, it occupies no space in the file — file size is 0.
  • __TEXT Segment: Read-only area containing executable code and constant data. Compiler tools create every executable with at least one read-only __TEXT segment. Since read-only, can map directly into memory just once — all processes can share safely (mostly useful in frameworks and shared libraries, but also running same executable multiple times simultaneously). Major sections:
  • __TEXT,__text: executable machine code
  • __TEXT,__stubs/__stubs/helper: helpers involved in call to dynamically linked functions
  • __TEXT,__cstring: constant c style (null terminated) strings. Duplicate strings removed by static linker when building final file.
  • __TEXT,__picsymbol_stub: Position-independent symbol stubs, allow dynamic linker to load region of code at non-fixed virtual memory addresses.
  • __TEST,__symbol_stub: Indirect symbol stubs.
  • __TEXT,__const: initialized constant variables. All nonrelocatable const variables placed here. Uninitialized constant variables placed in a zero filled section.
  • __TEXT,__literal4: 4-byte literal values, single precision floating point constants.
  • __TEXT,__literal8: 8-byte literal values, double precision floating point constants. Sometimes more efficient to use immediate load instructions.
  • __DATA Segment: Contains writable data, static linker sets the virtual memory permissions to allow both reading and writing. Because writable, segment is logically copied for each process linking with the library and marked as copy-on-write — when process writes to one of these pages, it receives its own private copy of the page.
  • __DATA,__data: Initialized mutable varaibles
  • __DATA,__la_symbol_ptr: Lazy symbol pointers — indirect references to data items imported from a different file.
  • __DATA,__dyld: Placeholder section used by the dynamic linker
  • __DATA,__const: Initialized relocatable constant variables.
  • __DATA,__mod_init_func: Module initialization functions (e.g., C++ static constructors)
  • __DATA,__mod_term_func: Module termination functions
  • __DATA,__bss: uninitialized static variables (e.g., static int i;)
  • __DATA,__common: Uninitialized imported symbol definitions (e.g., int i;, located in the global scope
  • __OBJC Segment: Contains data used by the objective-c language runtime support library.
  • __IMPORT Segment: contains symbol stubs and non-lazy pointers to symbols not defined in the executable. Generated only for executable targeted for the IA-32 architecture.
  • __IMPORT,__jump_table: Stubs for calls to function in dynamic library
  • __IMPORT,__pointers: Non-lazy symbol pointers — direct references to function imported from a different file.
  • __LINKEDIT Segment: contains raw data used by the dynamic linker: symbol/string/relocation table entries.

Concepts:

  • Position Indepenent Code (PIC) same library code can be loaded in location in each program address space where it will not overlap any other uses of memory, can be executed at any memory address without modification — as opposed to absolute code (loaded at specific locations) or load-time locatable code (ltl, where linker or loader modifies program so it can only be run from particular memory location).
  • Indirect addressing is code generation technique allowing symbols to be defined in files separate from referencing files, allowing independent modification. Symbol references can be of two types: non-lazy (a.k.a. symbol pointer, resolved by dynamic linker when module loaded), or lazy symbol (dynamic linker overwrites lazy symbol pointer with the address of the function, subsequent calls jump directly to definition). Lazy symbols are composed of a symbol pointer and symbol stub (small amount of code that directly dereferences and jumps through the symbol pointer). Compilers generate lazy symbol references when it encounters calls to functions defined in other files.

References:

Leave a Reply

Your email address will not be published. Required fields are marked *