In this article, we will tell you a bit about what compilation and code interpretation are. What are the differences and similarities between both approaches? If you're eager to broaden your horizons and become a better programmer, we invite you to read on.

 

Basic Concepts

Due to the confusion surrounding these concepts within this topic, we propose starting by familiarizing yourself with the definitions we sourced from the Internet. This way, we will ensure that we understand the same phenomena under the same terms.

  • Source Code – detailed instructions of a computer program using a specific programming language, describing the operations that the computer should perform on collected or received data. Source code is the result of a programmer's work and allows the structure and functionality of a computer program to be expressed in a human-readable form. It is usually stored in a text file.
  • Compiler – a program used for automatically translating code written in one language (source language) into equivalent code in another language (target language). This process is called compilation. In computer science, a compiler most commonly refers to a program that translates source code in a programming language into machine code. Some compilers first translate to assembly language, and then assembly language is translated to machine code by a so-called assembler (e.g., the GCC compiler uses the GAS assembler).
  • Assembler – an informatics term related to programming and creating machine code for processors. It refers to a program that generates machine code based on source code (assembly) written in a low-level programming language. This language, based on basic processor operations, is called assembly language, commonly referred to as assembler. So, a low-level programming language should be called assembly language, and the program for translation is an assembler, although colloquially we often refer to it as assembly language.
  • Binary File – a file with arbitrary content, encompassing all files except text files (executables, music, videos). Binary files are usually treated as a sequence of bytes, meaning bits are grouped into octets (bytes). These bytes are meant to be interpreted as something other than text characters. Binary files cannot be edited using text editing programs; these programs assume that the file contains text and interpret data as control codes. Binary files can be edited using a hexadecimal editor (xxd, ghex, IDA).
    Not every executable file is binary (e.g., a Python script is executable but not binary). Not every binary file is executable (e.g., a compiled library .lib or .dll is a binary file but not executable). A file with the .exe extension in Windows is both executable and binary.
  • Object File – a binary file generated by a compiler or assembler during the compilation of a source code file or during the linking of object files by a linker. For example, in C++, one object file is typically created from one .cpp file.
  • Linking – a process of combining compiled modules (files containing object code or static library files) and creating an executable file or, less frequently, another object file. Additionally, during linking, appropriate headers and information characteristic of a specific executable file format can be included in the resulting file. The tool used for linking is called a linker.
  • Executable File – a file that can be run directly in an operating system environment. It contains instructions in a form that allows its execution by a computer. In Windows, these can be files with the .exe extension. Another definition states that executable files are files that can be run as processes in a system (though this is not entirely true).
  • Machine Language, Machine Code – a set of processor instructions in which a program is expressed in the form of binary numbers constituting instructions and their arguments. Machine code can be generated during the compilation process (for high-level languages) or assembly (for low-level languages). Often, during the process of generating machine code, portable intermediate code is created and saved in an object file. Then, this code retrieved from the object file undergoes linking with code from other files to create the final form of machine code, which is stored in an executable file.
  • Library – a file providing functionality, data, and data types that can be used from the level of a program's source code. Using libraries is a way to reuse the same code.
    Depending on the time of inclusion in the program, we distinguish between static libraries (e.g., .lib) included during linking and dynamic libraries (e.g., .dll) included during program runtime.

 

Compilation

Simply put, compilation is the process of translating a programming language into another language or machine code. Computers cannot understand the programming language that humans understand. Therefore, a compiler acts as a "translator" capable of producing a file in machine code, which is understood by a specific processor operating within a given architecture.

Figure 1 - High-level depiction of compilation (Source: EngMicroLectures).

 

In other words, there are two versions of your program - one that you understand and the computer does not (source file, e.g., "test.c"). And then there's the version in machine code, which you don't understand, but the computer does. So, a compiler is a "magical program" that allows us to translate "human-readable code" into "computer-readable code." After being launched, the program is loaded from the disk memory into the operational memory (RAM) and then executed by the processor.

Figure 2 - Program execution on the CPU, reading instructions from computer memory.

 

Is it really that simple?

Of course, the above models are a significant simplification of this complex subject. Upon delving into the process, it becomes clear that it involves many steps and is more intricate, as illustrated in the diagram below.

Figure 3 - A more detailed representation of compilation based on C++ (Source: Stackoverflow).

 

The preprocessor is a program that is part of the compilation driver, processing the source code according to specific rules called directives, resulting in source code ready for compilation. The C++ standard fully describes the correct operation of the preprocessor.

We also have a compiler, a program that converts a high-level language into assembly code tailored to a specific architecture. Subsequently, all files are transformed into an "object binary" form (think of this as an intermediate state between assembly and machine code).

In the end, an executable is created, a binary file that can be run on a specific system with a particular processor.

 

Example

Let's consider the following example to familiarize ourselves more closely with the practical functioning of a compiler. We'll use the C language and the GCC compiler on a Linux system.

Figure 4 - This is a simple program written in C. The program uses a function, two constants, and an argument to print "Hello Alex" on the screen.

Figure 5 - Preprocessor output after executing instructions with the -E and -P switches.

 

We stop after preprocessing and see constants resolved to specific values. Variables remain unchanged. Additionally, function headers are included in the file using the #include <stdio.h> directive.

Figure 6 - Compilation to assembly language.

 

To proceed to the next step with the help of the GCC compiler, let's compile our program into assembly language. The -S option in GCC generates an assembly file.

Figure 7 - Contents of the test.s file -> x86 assembler.

 

Let's compile the source C file again into ASM (.s), specifying the file name after the "-o" parameter. Then, compile testASM.s into an object file (test.o). The .o file can be compiled from both an assembly file and a C file using the GCC compiler. With the -c option, we take a step further and compile our code into an object file ".o".

Figure 8 - Compilation to assembly language, then to an object file.

Figure 9 - Contents of test.o - Text is unreadable in a regular editor.

 

Generated files can be examined using a hex editor. Here's the "test.o" file. It has 100 lines.

Figure 10 - Object file in a hexadecimal editor.

 

Now let's link the .o file to an executable file and try to run the program. The linker resolves dependencies of individual files and merges everything into a single executable file.

Figure 11 - End of the compilation process, program execution.

 

Using a binary editor, you can see that the executable file, due to the included dependencies, has around 1000 lines, which is significantly larger than the object file with just 100 lines.

 

Summary

We hope that we have presented to you in an understandable manner how compilation works. Next week, we will continue with topics related to low-level program operations and file formats. If you are interested, feel free to join us!

 

Sources: