Elements of Programming Languages
Compiler
Compliers turn one programming language (usually a high-level language) into a another language which will be executed at runtime. Compilation will generally happen either on command (by running a make file to compile code), just in time (JIT), or at runtime using an interpreter. Code does not need to compile all the way down to machine language. It can instead compile to some intermediate language which and be compiled down further later on to a language that is specific to the platform it is running on.
These are the steps for the usual compilation process:
- Lexical Analysis: When code arrives to the compiler, it is in essentially one long string. The lexer breaks up that string into small pieces known as tokens. These tokens will track what everything in the code base is, be it a keyword, a variable name, an operator, a string, etc.
- Parsing: The parser applies the rules of the language to the tokens. The parser creates a hierarchical data structure (such as a parse tree or abstract syntax tree) of the tokens and ensures correct syntax. At this point, the compiler will be aware of whether something is function, a call to a function, and if statement, etc.
- Semantic Analysis: This step involves the heavy lifting of figuring out what you're doing. Type checking, object binding, and definite assignment may happen here, depending on the language. Much of the differences between languages will be shown here. Implicit return? Closure? Hoisting? Scoping rules?
- Code Optimization: Code gets optimized to run faster and use less memory. Ex: replace long variable names with something shorter.
- Code Generation: Code gets packed up into an executable to be run.
Interpreted languages, such as Ruby, JavaScript, and Python, are usually differentiated from compiled languages. The main difference is that these languages are compiled every time at runtime. In Ruby on Rails or Node, for example, the server will compile when you start it and that compiled code will be held in memory until it is stopped. Starting it back up requires recompiling. This process is slower than simply compiling it separately and running that compiled code when you need it.
Memory Management
Call Stack and the Heap
The call stack is a stack data structure that contains all the information for an executing function. The context of the function (all its scope variables) is specific to that function. It's a stack because functions that invoke other functions are LIFO. Consider:
def func1
func2
p "I'm func1"
end
def func2
func3
p "I'm func2"
end
def func3
p "I'm func3"
end
thing1
func1 enters the call stack first, then func2, then func3. func3 finishes first, so it's the first one off of the stack.
What's really important to know about the call stack is that memory is deallocated automatically once the function is complete (for the most part at least). This is in contrast to the heap, which is essentially your global data store. Essentially everything you do outside of calling functions will use the heap. Anything that is stored to the heap will need to be explicitly removed. Most languages use a garbage collector to do so, although some (C and its derivatives) don't have a garbage collector and memory must be deallocated manually by the programmer.
Garbage Collection
A garbage collector will try to figure out what objects and values in the memory heap are no longer needed and can be removed. This is a hard, but important problem to solve since this can have a huge impact in overall performance.
Tracing is by far the most common strategy for garbage collection. Tracing works by searching for objects that are reachable from a root objects. Anything in the heap that isn't reachable will be garbage collected. How it does that exactly varies widely by language and implementation, and can get very complicated (but in interesting ways).
Reference counting is an alternative to tracing. Instead of occasionally running a trace, a count of references will be maintained for each object. Objects whose count drops to 0 are deallocated.
Types
Static vs Dynamic typing
All expressions in statically typed languages have their types determined before the program is executed, which will usually be at compile-time. Statically typed languages are further subdivided into manifestly typed or type-inferred languages. Manifestly typed languages, such as Java and C++, require the programmer to explicitly state what types values are, and what types functions accept as arguments and what their return values are as well. Type-inferred languages, such as Haskell and Swift, infer the type from the context of the expression. Types will be determined at compile-time, catching errors then instead of runtime, but the programmer doesn't always have to explicitly state what everything is.
Dynamically typed languages determine type-safety at runtime. Errors will not be caught until a piece of code is executing, but the programmer doesn't need to write type annotations on expressions. It also has the advantage of allowing a single variable to refer to different types of values, which is be useful for duck typing. Ruby, Python, and JavaScript are all dynamically typed.
The debate of static vs dynamic typing can be fierce. One may not necessarily be better than the other, it should just come down to the preference of the programmer.
Weak vs Strong typing
In some contexts, the term "strong typing" may refer to static typing (so watch out). In the context of weak vs strong typing, it means that an error will be raised if you attempt to perform an operation on the wrong type of value. Weak typing will instead try to coerce a value to permit the operation. In JavaScript, a weakly typed language, the expression 3 + "3" will evaluate to "33". 3 gets coerced to a sting, and the values are concatenated together. In Ruby, a strongly typed language, the same expression will raise an error.