Inference of Text-Generation models

Download this notebook

On JupyterHub

In the first part of our tutorial about LLMs we will learn about how to use a text-generation model from Huggingface on JupyterHub. If you got a local machine with CUDA installed, all the steps should be the same, but installing the right environment correctly can be tideous.

So start a jupyter notebook and select the standard kernel. Make sure you selected some GPU when starting the JupyterHub. If you need to install transformers, please use pip in your default python environment:

pip install transformers

Torch should be pre-installed (JupyterHub) or in your module chain (PALMA). On a local machine it can be difficult to install CUDA and then torch within the correct environment.

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
print(f'CUDA avail: {torch.cuda.is_available()}')
for i in range(torch.cuda.device_count()):
        device_properties = torch.cuda.get_device_properties(i)
        memory = device_properties.total_memory
        print(f'GPU {i}: {device_properties.name} with {int(memory / 1024**2)}MB RAM')

device = 'cuda:0' if torch.cuda.is_available() else 'cpu'

Here you can see if CUDA is available and how much VRAM you can access. This notebook is written for usage of only one GPU, usually cuda:0 but should also run with cpu.

The Huggingface caching mystery

Huggingface provides a very simple interface for models. The package transformers automatically downloads them but you don’t know where these models are on your disk and they can be huge! So we should be careful when loading the models.

We have different options for where to load our models:

  • The default Huggingface Cache is a hidden folder ~/.cache/huggingface/models. But since the models are huge, this can easily burst your partitions!
  • If you use PALMA (or have a PALMA user) you can use /scratch/tmp/$USER/huggingface/models/ and remove it later (CHECK how to access Scratch from JupyterHub!! Havent found a way yet!)
  • Otherwise for small models (!) just use a not hidden cache dir (e.g. ~/huggingface/models) for instance and remove it later. If you get errors here, that might be due to permissions. Then use the standard Huggingface cache.
  • For bigger models you could also use an OpenStack usershare /cloud/wwu1/{group}/{share}/cache (but PALMA scratch might be faster)
  • Or just leave it as it is, but be aware!

So now remember where you stored your model, you will need the cache dir later on. We will continue loading the model into and from cache_dir = "/cloud/wwu1/d_reachiat/incubai/cache". Below you see, how to download and start the smallest Pythia model. Pythia is a collection of Open Source LLMs for text generation, similar to GPT (closed source) or Llama (restriced license).

# Download the model from Huggingface model hub
model_name = "EleutherAI/pythia-70m-deduped"
cache_dir = "/cloud/wwu1/d_reachiat/incubai/cache"
tokenizer = AutoTokenizer.from_pretrained(model_name, cache_dir=cache_dir)
model = AutoModelForCausalLM.from_pretrained(model_name, cache_dir=cache_dir)
# Load a model from cache instead of Huggingface model hub
tokenizer = AutoTokenizer.from_pretrained(model_name, cache_dir=cache_dir, local_files_only=True)
model = AutoModelForCausalLM.from_pretrained(model_name, cache_dir=cache_dir, local_files_only=True)

# Load model into GPU memory if available
# The model load first to CPU memory and then to GPU memory
# To avoid this, you can load the model directly to GPU memory with the library accelerate

model = model.to(device)

# print memory reserved, be aware that this is the memory reserved for the model and not the memory used by CUDA
# To get the memory used by CUDA in total, use nvidia-smi
print(f'VRAM reserved: {int(torch.cuda.memory_reserved(0) / 1024**2)}MB')

We can now proceed to prompting. That means we give the model a sentence, for which it generates new text that should follow the given input sentence in a logical way.

As you will see, the text which is generated by our small pythia model is repetitive and not very good. Changing some parameters can help, but the small models used for testing purposes here are not suitable to find good parameters.

To learn more about text generation strategies, you can visit the huggingface site about generation strategies.

You can now use the following way to prompt:

sentence = "The movie Gladiator was directed by Ridley Scott. The main actor is "

inputs = tokenizer(sentence, return_tensors="pt")
inputs = inputs.to(device)

tokens = model.generate(**inputs,
    do_sample = True, 
    max_length = 50,                             
    #to test how long we can generate and it be coherent
    #temperature = .8,
    top_k = 50, 
    top_p = 0.85
) 
output = tokenizer.batch_decode(tokens, skip_special_tokens=True)
print(output[0])

In order to simplify this process, we can build a pipeline. The first argument of the pipeline is the task we want to use the pipeline for, in our case text generation. The other inputs are the previously defined model and tokenizer, as well as the arguments of the model.generate function from above and our specified device.

So let’s build our pipeline:

from transformers import pipeline

text_generation_pipe = pipeline('text-generation', 
                    model=model, 
                    tokenizer = tokenizer, 
                    do_sample = True, 
                    num_beams = 5,
                    max_length = 100,                             
                    #to test how long we can generate and it be coherent
                    #temperature = .8,
                    top_k = 50, 
                    top_p = 0.85,
                    device = device
                   )
text_generation_pipe(sentence)

If you have some prompts in a text file, you can load that text file and use a pipeline to process. It is more efficient to let the pipeline iterate over the data than to use a loop over the pipeline.

sentences = ["The following essay is about the history of quantum mechanics.", 
             "The movie Gladiator was directed by Ridley Scott. It starred ",
             "The Eiffel Tower is in Paris, France. It is made of",
             "I want you to help me with my homework essay for school. The topic is the history of the Roman Empire.",
             "The following is a list of the tallest buildings in the world. The tallest building in the world is the",
             "The negative binomial distribution has the following pdf $$f(x; r, p) = \binom{x+r-1}{r-1} p^r (1-p)^x,$$ where",
             ]
def prompts():
        for sentence in sentences:
            yield sentence

for answer in text_generation_pipe(prompts(),pad_token_id=tokenizer.eos_token_id):
    print(answer) # or answer[0]['generated_text'] if you want to print only the generated text

Now we were hopefully able to:

  • download a text-generation model
  • load the model from a self-defined cache
  • use multiple prompts to test the model

Use a python script

In order to move on PALMA to deploy bigger models, we need to convert everyting into a python script. You can find the scripts in 2_2_LLMs/text-generation/scripts/. You can now try if the script also runs by trying the following prompt in the terminal:

python pythia.py --cache_dir /cloud/wwu1/d_reachiat/incubai/cache --size 70m --prompt "My sample prompt"

You might also experiment with a prompt collection like those in ~/2_2_LLMs/text-generation/data/prompts.txt and an outfile with

python pythia.py --cache_dir /cloud/wwu1/d_reachiat/incubai/cache --size 70m --prompt_file ../data/prompts.txt --out_file out.csv

where you can get a nice csv of your prompts and the generated text of the model.

If all this works for you on the JupyterHub, you might be interested to deploy bigger models on PALMA.

When you run the script you will get information of the GPU memory (VRAM) usage of the model. You need to add a CUDA overhead of about 1 GB to find the expected memory usage. Thus, the 6.9b model of pythia is too big for JupyterHub. While running the pipeline, you can open a terminal and type nvidia-smi to find the memory usage.

Moving to PALMA

Now if everything goes right, we want to move to PALMA. We want to use the GPU partitions there in order to run bigger models than we can in the JupyterHub. At first, the gpuexpress partition is suitable for testing.

If you don’t know how to use PALMA, read our tutorial in 2_1_PALMA. Also the HPC Wiki gives a good overview about how to use PALMA.

Installing requirements

We now want to use the shell scripts in the folder 2_2_LLMs/text-generation/jobs to generate text from our models.

We use a specific so called toolchain to be able to use CUDA. The following toolchain is suitable:

module load palma/2021a
module load foss/2021a
module load PyTorch/1.10.0-CUDA-11.3.1

You can find this toolchain by typing module spider PyTorch. But as the login node is on a different architecture, you would need to make job script as below to find the right name and CUDA Version on the other archtecture.

Typing these commands in the command line shows, that the last module is not available on the login node. Therefore, to install further packages, we must be inside this toolchain. To make sure, that the right Python and PyTorch Versions are used we install the package transformers via pip with a job script install.sh. As we use Torch 1.10, which is a pretty old version we take care to use a transformers version that is suitable, for instance transformers==4.33.1.

Then we can run the install script on the right architecture by using the command sbatch install.sh in the directory /2_2_LLMs/text-generation/jobs/. When the job is finished, check the outfile with vi, so you can be sure no new torch version was installed (what might bring a lot of conflicts).

In case something went wrong, remove the installed packages in your home directory (due to the --user flag they are installed into ~/.local/, you can remove the folders there).

Prepare and run the model

Now hopefully everything went right. Check the pythia-70m-test.sh script now. If you have your model in your usershare, you should use it as the model dir in the script. If you don’t have a usershare, you can copy the whole model directory to your scratch directory. For instance on PALMA (!) use

cp -r ~/cloud/wwu1/u_jupyterhub/home/<first letter of username>/<username>/.cache/huggingface/models/models--EleutherAI--pythia-70m-deduped $WORK/2_2_LLMs/text-generation/models/

if you used the standard huggingface cache (see caching Huggingface above). There is no nice way to download the Huggingface models directly, so if necessary, start a script (see above) that downloads the models to the scratch dir, but does not start (or crashes due to limits).

Now the data should be in your scratch directory. We should be ready to run the first small model. Go back to ~/incubaitor/2_2_LLMs/text-generation/jobs/ and start the job with the command sbatch pythia-70m-test.sh. In the output file, you should read if everything went well via vi slurm-pythia-test-1b-express.out. Furthermore, the out-file should be on your scratch partition. You should be able to read the contents of the outfile with vi /scratch/tmp/<username>/pythia-70m-express.csv or access that file by copying it to $WORK/transfer if you prepared the PALMA Nextcloud Integration and download it via the Web-Interface (still under development). For further info about how to transfer data, visit the HPC documentation.

If you are happy with the results, test if you can also get the 1b version to work in the same way.

Change the script for your needs

If you want to change things in the script or test other functions of the model, you can also play with the small models using the JupyterHub. If resources are available, you can also start the jupyter.sh script on PALMA and play around on your own machine. When you are ready, you can make these changes in the pythia.py file (the best way would be to clone it into your private git, make changes, pull the changes to PALMA and run the script for testing purposes on a small model).

Then, if resources are available, you can try to run inference on a bigger model on Palma. See the job scripts for the 6.9b and 12b model.

Llama-2 and other models

The Llama text-generator is provided by meta. To download the Huggingface version, you need to register for using the Llama-2 models on meta and an access key of Huggingface. After downloading this model for instance on JupyterHub and caching it to your usershare or scratch directory, you can access it on PALMA. The smallest model might be too big for JupyterHub and crash your Kernel but it might be a convenient way to download it.

If you got that, see the llama.py script and the corresponding job script. The only change in the python file is the model selection. This way you can change your script for whatever Huggingface text-generation model you want to use.

Beyond text generation

There are many other types of models available on Huggingface. They all work with similar pipelines. You can check on the model card (top right </> Use in transformers) how to load the models and how to build a pipeline. For the pipeline due to the caching issues, use a similar approach as above. (Remember to set local_files_only=False when downloading the models!)

Then you need to check, how to provide the pipeline with input and how the output looks like. This should also be provided in the model card. For instance the following can be used for text-classification:

from transformers import AutoTokenizer, AutoModelForSequenceClassification

del text_generation_pipe

model_checkpoint = "facebook/bart-large-mnli"

tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, cache_dir=cache_dir, local_files_only=True)
model = AutoModelForSequenceClassification.from_pretrained(model_checkpoint, cache_dir=cache_dir, local_files_only=True)
model =  model.to(device)

classifier = pipeline(
    "zero-shot-classification",
    model=model,
    tokenizer=tokenizer,
    device = device
)
sequence_to_classify = "Given our strong start to 2021 and underlying acquisition retention and monetization of players we are increasing our guidance to $1.05 billion to $1.15 billion of revenue for 2021 which equates to year-over-year growth of 63% to 79% and a 16% increase compared to the midpoint of our prior guidance."
candidate_labels = ['increase', 'decrease']

result = classifier(sequence_to_classify, candidate_labels)

print(result)
from transformers import AutoTokenizer, AutoModelForQuestionAnswering
model_checkpoint = "consciousAI/question-answering-roberta-base-s-v2"
del classifier

tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, cache_dir=cache_dir, local_files_only=True)
model = AutoModelForQuestionAnswering.from_pretrained(model_checkpoint, cache_dir=cache_dir, local_files_only=True)
model =  model.to(device)

question_answerer = pipeline(
    "question-answering",
    model=model,
    tokenizer=tokenizer,
    device = device
)
sentence = "Our active customer count grew by 57% over the first quarter of last year to reach $33 million; and we delivered 14.7 million orders 49% more than the year prior"
question = "How many orders?"
result = question_answerer(question=question, context=sentence)
print(result)

You can also use multiple questions (on multiple texts) iterating through the pipeline.

import pandas as pd

df = pd.DataFrame({'text': [sentence] * 3,'question' : ['how many orders?', 'how many customers?', 'how much revenue?']})


def prompts():
    for i, row in df.iterrows():
        yield {'context': row['text'], 'question': row['question']}


for answer in question_answerer(prompts()):
    print(answer['answer'])

Now you can use Huggingface models! Further models for audio recognition or image recognition will need some other packages like OpenCV, which could also be in the toolchain of PALMA.