Technospire
Posts
A brief introduction to programming languages and how to make one

A brief introduction to programming languages and how to make one

Shivoham Sud
June 25, 2024

A Brief Introduction to Programming Languages and How to Make One

Languages is a broad topic that sits at the heart of computer science. We all use programming languages every day, to program algorithms to solve problems, but wouldn't it be interesting to learn how these languages work, and then explore the question of how to make one for ourselves?

Before we begin we need to understand what a programming language is. As per Wikipedia, a programming language is a system of notation for writing computer programs; it is a way for us as a programmer to communicate our algorithm to a computer, and then allow the computer to execute it and return a result. However, a computer isn't a human and can only understand a series of 0s and 1s. So how do high-level programming languages such as Java and Python work?

Compilers vs Interpreters

We can "translate" this high-level code into machine code in two main ways. Let us first begin with compilers.

Compilers translate the entire high-level code into machine code in one go, creating an executable file, which is made up of machine code, that can be run by the hardware (independently of the source code) making compilers very fast. Another benefit of compilers is that errors can be found at compile time, before execution, which allows many types of errors to be caught early.

On the other hand, interpreters translate high-level code into machine code line-by-line and then execute it immediately. The advantage of this is that it does not use as much memory since we don't need to store an executable file this time. Another advantage is that interpreters can give you immediate feedback when they encounter an error and this can be useful during software development.

Interpreters are often used in scripting languages (a programming language used for automating tasks), whereas compilers are preferred when high performance is required. Ultimately, it depends on the specific requirements of the programmer.

Optimizations

Some languages, such as Java and C#, use a combination of compilation and interpretation. This works by the source code first being compiled into an intermediate representation such as byte code, and then interpreted or compiled again. An advantage of this is that you get performance optimizations as compiling the source code into byte code is more efficient than just interpreting the code line by line.

Languages like Java, also use a layer of abstraction over the hardware by running on a virtual machine. A virtual machine is software that allows you to emulate the hardware of a physical computer. The Java virtual machine is responsible for executing bytecode. The advantage of this abstraction is that the language gains platform independence because the byte code is portable. This means that the program is unaware of the underlying hardware architecture, therefore it can provide a consistent environment for Java programs regardless of what platform they are running on, hence Java programs are "write once, run anywhere".

The translation process

I would now like to briefly explain how this translation process works under the hood. First, we break the source code into tokens - this process is called lexing, or scanning. Each of these tokens is a meaningful piece of syntax that abides by the rules of the language we have defined. The way we can code this is simply a very smart switch statement that can recognize a series of characters to be an accepted piece of syntax in the programming language.

Now that we have this series of tokens, we need to parse through it. In the scanner we just created, an "alphabet" in this lexical grammar, would be a single character we inputted, and a "string" would be each token we created. In syntactical grammar, an "alphabet" is just a token we receive from the scanner, and a string is an expression generated from those tokens. This is the parser's job: to take a series of tokens and generate an expression. In the same way, we defined what a valid token was in our scanner, we need to define what a valid expression is for our parser. To do that we use the good old Backus Naur Form (or BNF). BNF is a meta-language used to unambiguously define the rules of a programming language. I think the best way to explain BNF is by showing it, so here is an example of BNF I have taken from a book I am reading:

Using the power of recursion, BNF can represent all the possible valid expressions written in a language. Since this grammar is recursive, if we were to implement these rules, we would have to use a syntax tree to form the data structure. Here is an example of a syntax tree we can create using the BNF above. Remember that the job of the parser is only to generate an expression from a series of tokens and not to evaluate anything.

Now that we have scanned through the data and parsed it before we can evaluate the tree we need to perform static analysis. In the parse tree above, when we are analyzing 2 * 3, we need to find where the identifiers for 2 and 3 have been defined and then we can wire them together. While we do this, we can make sure that both of these variables are being accessed within the scope that they have been defined in. As well as this, now that we know where both of the variables are defined we can check if their variables can support being multiplied together, if not then we can return a type error.

Finally, all we have to do is evaluate the tree. We can do this using a depth-first search, an algorithm where we go as far down a branch as possible, before backtracking and exploring another branch, until the entire tree has been explored.

Error handling

Before I wrap up, I would like to highlight the importance of error handling. There are so many things that can go wrong, such as syntax errors, type errors, and runtime errors, and it would be very annoying for a language to crash and burn every time we make a mistake. Good error handling ensures that the interpreter can provide meaningful feedback to the user, highlighting the location and nature of the error. This not only helps in debugging but also improves the overall user experience. Syntax errors, for example, should be detected during the parsing phase, with clear messages indicating unexpected tokens or misplaced symbols. Type errors should be caught during static analysis, ensuring operations are performed on compatible data types. Runtime errors, such as division by zero or null reference exceptions, should be handled in the evaluating phase to prevent the interpreter from crashing and to allow the program to recover or terminate gracefully. Overall, good error handling makes the language more resilient and user-friendly.

Conclusion

I have ended up writing quite a long article, however, I feel that this was the only way to serve justice on such an interesting topic. I hope that if you have made it this far, you have more clarity on what goes on under the hood when you press the run button on your IDE. Yet again, there are a lot of nuances I have not covered, so I would highly recommend reading the book Crafting Interpreters to learn more.