Software Engineering
In our introduction, we will cover all these frameworks. It’s important to know that all these frameworks are constantly being further developed. This leads to programs becoming outdated very quickly and possibly not running with newer versions of the frameworks. Therefore, it can be sensible to set up so-called virtual environments for individual projects and store the used versions of the frameworks and additional packages in a file requirements.txt
.
Furthermore, in scientific or business use, it is inevitable that other people will work on a project or continue to develop it. A small project can quickly become very complex, and one should use software development tools and principles from an early stage and familiarize oneself with these tools. The most important and central tool here is git, which enables version control and collaborative work on code. This code can then be shared with others, for example, via GitHub or GitLab. The University of Münster has its own GitLab for this purpose.
It is also advisable to familiarize oneself with a Linux environment. If you don’t have a Linux computer yourself, a Unix environment is available to all members of the University of Münster via JupyterHub, which is specifically designed for scientific software and various programming languages. For MacOS, it is recommended to use brew to provide the full capabilities of the Unix system, and for Windows, there is the Windows Subsystem for Linux (WSL). There is a diverse range of editors available for different operating systems to program with. As a programming environment for the incubAItor, Visual Studio Code is particularly suitable, as it can be easily extended. For deployment, it will also be important to provide applications as a complete system.
In summary, we will use the following tools:
- Unix system and Unix command line
- SSH
- Python (from version 3.6), package management pip, and virtual environments
- Jupyter notebooks on JupyterHub or in Visual Studio Code
- AI frameworks Torch, TensorFlow, Keras, PhotonAI, and Transformers
- Git for version control
- Docker
In the following sections, we will cover these tools, with Docker being used later and not needing to be installed immediately.
Unix Command Line
For beginners, working with the command line can be particularly challenging. While using a programming environment like Visual Studio Code can provide assistance, basic knowledge is unavoidable for this introduction. In JupyterHub, you can access the command line by going to New Launcher => Terminal and, for example, use the ls
command to display the contents of the current directory. With cd <folder>
, you can navigate to a folder named <folder>
from there (auto-completion with the TAB key is often helpful). With cd ..
, you can move up one level. Important for us are also Python commands. If Python is installed, you can run the python
program. Then, a Python interpreter should open, and you can execute commands like print("Hello World")
. With quit()
, you can exit the program. Any running program can also be terminated with CTRL+C.
SSH
SSH is a secure network protocol that enables encrypted connections over insecure networks. It is used to protect confidential data from unauthorized access and is frequently used for secure server access, file transfers, and remote commands.
In our tutorial, we will need SSH access for logging in to PALMA and in the chapter on OpenStack. We will explain how to find an existing SSH key or generate a new one.
A. Using an existing SSH key
You may already have an SSH key. You can then simply use it. It is usually located in your home directory in a folder named .ssh/
. If this folder is empty or does not exist, you probably don’t have an SSH key and need to generate a new one (see the next section).
SSH keys always consist of a pair of private and public keys. There are different algorithms that can be used to generate the keys. The files are usually named accordingly:
ALGORITHM | PUBLIC KEY | PRIVATE KEY |
---|---|---|
ED25519 (preferred) | id_ed25519.pub | id_ed25519 |
RSA (at least 2048-bit key) | id_rsa.pub | id_rsa |
DSA (obsolete) | id_dsa.pub | id_dsa |
ECDSA | id_ecdsa.pub | id_ecdsa |
B. Generating a new SSH key
If you don’t have an SSH key, you need to generate a new key pair. You can use the ssh-keygen
program to do so. You probably already have it installed (depending on your operating system) or installed it along with Git. Run the following command in a console:
ssh-keygen -t ed25519 -C "<comment>"
You will then be asked where you want to save the new key pair. If you’re unsure or don’t have a special reason, you should leave it at the default setting and confirm with Enter.
Next, you will be asked for a password. With this, you can restrict access to the key. It is possible not to provide a password, but note that then anyone with access to your device can also access the key. With the private key, you can impersonate yourself to the GitLab server (and other servers where you store your public key). The password entered here provides an additional security hurdle and must be entered every time you access the server (or the private key required for it).
Once you have found or generated your SSH keys, you can upload them to the IT Portal of the University of Münster. To do this, go to the left menu and click on Passwords and PINs => Manage public SSH keys.
Here, you can now copy the contents of the file containing the public key (ending with *.pub) into the input field or upload the file itself (you should find it in the ~/.ssh
folder).
Warning Make sure to upload the public key, never give out your private key (without file extension)! With this key, others can impersonate you in the worst case You should therefore keep your private keys safe. Ideally, you should have a separate key pair for each SSH access to your systems.
Python
It’s important to check which Python version is installed. This can be done by starting the Python interpreter in the command line, or by running the command python --version
. If an older version is installed, the command python3
may also be helpful. Along with Python, pip should also be installed, which is Python’s package manager, allowing new packages to be installed. The command pip freeze
displays the currently installed packages. New packages can be installed using pip install <package>
, which often also installs dependencies.
WARNING Since different versions of packages and frameworks have different dependencies (especially with AI frameworks), it may be sensible to create separate virtual environments for projects. Virtual environments venv allow individual, independent Python interpreters to be installed. If a different Python version is required, or more complex packages need to be installed, the package manager conda may also be helpful. At the same time, it’s important to remove old environments, as some packages like TensorFlow require a lot of disk space.
An introduction to Python can be found at W3 Schools. Chatbots like ChatGPT can also assist with programming, a prompt like “Please help me to multiply a 3x3 matrix with a vector in python” should provide the desired code lines, and can be further specified to use the numpy package. The packages numpy and pandas are fundamental for working with data and will be used extensively in this introduction.
Jupyter Notebooks
Python programs are individual text files, saved as program.py
and executed using python program.py
. However, this approach has limited debugging capabilities, which is why Jupyter Notebooks are suitable for beginners and especially for data science applications. Here, individual sections of a program can be entered and executed one after another. In JupyterHub, you can launch a notebook by clicking on Launcher => Notebook. In the top right corner, you can change the kernel used (the kernel is roughly the Python environment you want to use). In Visual Studio Code, you can create a notebook as a file with the .ipynb
extension. Here, you can also choose the kernel (you can even connect to a remote kernel like JupyterHub and use it). To use Jupyter Notebooks, you may need to install them using the command line with pip install notebooks
.
In Jupyter Notebooks, you can enter code and execute individual sections using the Play button or Shift + Enter. If packages need to be installed in between, you may need to restart the kernel. Additionally, you can create individual text sections using the Markdown language to document the code or write instructions, as in this text.
Conda Environment
Conda is a platform-agnostic package and environment manager for Python. This open-source management system runs on Windows, macOS, and Linux operating systems. It allows creating environments tailored to the needs of the programs used, including specific Python versions, libraries, and hardware requirements. For example, a hypothetical Program A may only work with a specific version of Program B. If Program C then requires a different version of Program B, problems are inevitable.
To avoid these issues, Conda offers the following features:
Conda enables easy installation, updating, and uninstallation of software packages, supporting both Python and non-Python packages.
Conda allows creating and managing isolated environments with different packages and Python interpreters. This makes it possible for developers to switch between different projects without encountering dependency issues.
As an open-source management system, Conda is platform-independent and runs on Windows, macOS, and Linux, enabling consistent management and collaboration among different individuals.
Conda is often used in conjunction with the Anaconda Distribution or the lighter Miniconda installation, which include a collection of essential Python packages from data science and scientific domains.
AI Frameworks
AI frameworks are packages in Python. The most popular frameworks are TensorFlow and Torch, which allow you to load data, construct neural networks, and perform important operations. To avoid having to write entire network architectures and frequently executed operations from scratch, additional frameworks (such as Keras) have been developed, which use one of the fundamental frameworks as a backend. Furthermore, there is scikit-learn, a library with common operations. PhotonAI, developed at the University of Münster, can also help with developing and evaluating standard applications.
TensorFlow and Torch can run on the CPU or GPU. To do this, the networks and data are loaded into the corresponding memory. For GPU applications, communication with the graphics card is based on CUDA. Simply, you can push a model model
or a tensor
to the graphics card by calling model.to('cuda')
. Then, the calculations (mostly matrix multiplications) are executed much faster (you can also use Torch for many operations instead of NumPy to execute them on the graphics card).
Git
Who hasn’t seen a file named “Bachelorarbeit_v3_final_final.docx” before? Every developer has experienced destroying their own, ongoing programs through further development. Error detection can then be very time-consuming. You might also want to know how you solved something previously, even if you no longer need that part of the program. Versioning programs based on file names is impractical. That’s why Git has become the standard. Git should also be installed on your local computer, which you can test using the command git --version
. On JupyterHub, it’s already installed and can be used simply without the command line (and you can also install an extension in Visual Studio Code).
Git is a system that allows you to clone a repository to your local computer and continue developing it there. When you save files, you can update them in the (local) repository using so-called commits. This enables you to see the history of the document. Further advantages arise when you upload these local repositories (push), so that these changes can be synchronized with other developers through a pull. Git offers much more in terms of collaborative work. A locally initialized repository can also be uploaded to a server like the GitLab of the University of Münster or GitHub.
You can also find diverse solutions publicly available at GitHub.
The University of Münster provides its own GitLab-Instance.
We will also provide an introduction to Git, as it is central to organizing code and deployment pipelines.
Creating a Personal Access Token
To access resources (e.g., Docker images or containers) in GitLab or to upload them, Personal Access Tokens (PAT) are used. Creating one is not complicated: Go to your profile (e.g., via your avatar) and go to Preferences -> Access Tokens. From here, you can create a new PAT by clicking on Add new token. After selecting a suitable name and expiration date, you can set the scopes of your PAT. Since we often want read and write access to repositories and registries, you should at least select api, write_registry, and read_registry. After creating the token, make sure to store it securely, as you can only see/copy it once after creation.
Docker
Docker will also be introduced throughout this tutorial. In principle, Docker allows you to provide individual applications in so-called containers. You can imagine these containers as slim virtual computers created solely for providing the application. All dependencies are installed within them. You can then run a built container locally or upload it to a server. Since Docker loads and installs all dependencies, building a container can be time-consuming. However, containers consist of various layers that build upon each other. The building of the container only starts from the layer where a change has occurred, so good organization of the code also promises efficiency here. Additionally, a project can comprise multiple containers that together form an application for different parts (e.g., separate backend and frontend).
Docker not only enables you to make a complex project with different applications easily accessible to others but also to upload and run the application on a server (Deployment). You can either build the containers locally and upload them or build them directly in the GitLab of the University of Münster through a suitable pipeline from the code. Finished containers can be stored in a repository (e.g., Docker Hub or Harbor of the University of Münster), from which the servers can always pull the current (or desired) version of the container.
Hardware
In addition to software, hardware is also crucial, especially for AI applications. Training larger networks often requires significant resources. On the one hand, you can channel the hardware hunger through programming tricks (e.g., batch size), but on the other hand, you cannot avoid using graphics cards. Correctly installing frameworks in conjunction with graphics cards can be very time-consuming, which is why we will focus on the resources of the University of Münster in this tutorial, namely the JupyterHub and the PALMA II supercomputer.
Two additional notes on the used hardware are permitted:
- As a processor (CPU), x86 CPUs are used, which are built into most computers and connected to the working memory (RAM)
- As a graphics card (GPU), NVIDIA graphics cards with video memory (VRAM) and the corresponding CUDA interface are used
Modern Apple computers use ARM processors instead of x86 architecture. Most AI applications also work on these, possibly requiring special versions of the frameworks to be installed. The architecture also has the advantage that the GPU and CPU share the working memory, so the limitation of the usually small VRAM is not a major obstacle. When providing Docker containers that run on other architectures, you need to pay particular attention to building the containers for the correct architecture.
Cloud Applications
The goal of this introduction is to build a KI application that works on the web, for example, through an API request to a server. It’s important to note that while training a neural network requires significant resources, only a fraction of those resources are needed to run the application. To achieve this, the application with the pre-trained network needs to run on a web server. Docker containers are used for this, which can be run on different server systems. A Docker container is like a separate computer, where only what is required by the application is installed. Multiple containers can also work together to form an application, for example, if you want to separate the backend and frontend of an application or use completely different programming languages (JavaScript for the web application and Python for the KI).
By the end of this tutorial, simple KI applications will be provided as web applications. This will also cover the details of deployment pipelines (CI/CD), so that a change to the code in Git will ideally automatically update the web application.
The incubAItor refers to the cloud infrastructures provided by the university, OpenStack and Kubernetes. However, commercial providers also work with these or similar cloud infrastructures, so the transfer from university to commercial clouds should be straightforward with the right configuration.