Making InstructLab 'go fast' on WSL

Want to hear a NotebookLM-generated podcast created from this article? Check it out here.

I mentioned in an earlier article that I was hooked the moment I saw a machine learning / artificial intelligence model in-action (in this case, a genetic algorithm). Now that generative AI has taken hold, I think it's incredibly exciting seeing the innovation and new approaches taking off.

One of those new approaches, and the most exciting projects I've worked with recently, is InstructLab. In this article I'll look at my home-lab setup for InstructLab, and how I've configured the framework to use NVIDIA CUDA on the Windows subsystem for Linux (WSL).

Technologies

InstructLab

Before I get too much further I should properly introduce InstructLab.

InstructLab is an open source framework for adding new knowledge and skills to large language models (LLM). It's named after and based on IBM Research’s work on Large-scale Alignment for chatBots, abbreviated as "LAB". The LAB method is described in a 2024 research paper by members of the MIT-IBM Watson AI Lab and IBM Research.

InstructLab works by using a human-readable, question-and-answer approach to adding new skills and knowledge to an existing generative AI model (LLM). It then uses synthetic data - generated by an LLM - to build a set of training data used to fine-tune the model.

You can see the process described here (image available from Red Hat)

I think the best part about InstructLab is that it's not model-specific. It can support supplemental skills and knowledge for any LLM. This means that I can take the InstructLab framework to any domain and model, and add new skills and knowledge.

You can see an example of the question-and-answer framework that InstructLab uses below. Each set of questions has context (in this case, about the tonsils), and then we introduce a couple of example question-and-answer combinations.

- context: |
      The **tonsils** are a set of [lymphoid](Lymphatic_system "wikilink")
      organs facing into the aerodigestive tract, which is known as
      [Waldeyer's tonsillar ring](Waldeyer's_tonsillar_ring "wikilink") and
      consists of the [adenoid tonsil](adenoid "wikilink") (or pharyngeal
      tonsil), two [tubal tonsils](tubal_tonsil "wikilink"), two [palatine
      tonsils](palatine_tonsil "wikilink"), and the [lingual
      tonsils](lingual_tonsil "wikilink"). These organs play an important role
      in the immune system.

    questions_and_answers:
      - question: What is the immune system's first line of defense?
        answer: |
          The tonsils are the immune system's first line of defense against
          ingested or inhaled foreign pathogens.
      - question: What is Waldeyer's tonsillar ring?
        answer: |
          Waldeyer's tonsillar ring is a set of lymphoid organs facing into the
          aerodigestive tract, consisting of the adenoid tonsil, two tubal
          tonsils, two palatine tonsils, and the lingual tonsils.
      - question: How many tubal tonsils are part of Waldeyer's tonsillar ring?
        answer: There are two tubal tonsils as part of Waldeyer's tonsillar ring.

InstructLab supports adding supplemental knowledge and skills to a large-language model (LLM). Skills are performative ("things you can do"), while knowledge is based more on answering questions that involve facts, data or references.

The above example is a knowledge contribution. It is teaching the model new facts and data. A skill, on the other hand, might look more like this:

version: 2
task_description: 'Teach the model how to rhyme.'
created_by: juliadenham
seed_examples:
  - question: What are 5 words that rhyme with horn?
    answer: warn, torn, born, thorn, and corn.
  - question: What are 5 words that rhyme with cat?
    answer: bat, gnat, rat, vat, and mat.
  - question: What are 5 words that rhyme with poor?
    answer: door, shore, core, bore, and tore.
  - question: What are 5 words that rhyme with bank?
    answer: tank, rank, prank, sank, and drank.
  - question: What are 5 words that rhyme with bake?
    answer: wake, lake, steak, make, and quake.

It's clear here that knowledge contributions require far more content than a skill. An entire skill could be just a few lines of YAML in the qna.yaml file ("qna" is short for "questions and answers") and an attribution.txt file for citing sources.

Windows Subsystem for Linux (WSL)

The other piece to my InstructLab setup is the Windows Subsystem for Linux (WSL). Windows is still the desktop-of-choice for many organisations. However, many AI frameworks and tools (like InstructLab) run on Linux. How do we bring the two together?

Traditionally if you wanted to use Linux alongside Windows you would need to either create a separate virtual machine, or dual-boot the system. The WSL exists so that developers, AI practitioners, or anyone wanting to use Windows and Linux together can use both at the same time.

This article focuses on accelerating InstructLab on WSL. If you're not using WSL, there's a great guide available here for other platforms. I'm using also Fedora on WSL, and there's an excellent article here to get a Fedora environment setup on WSL.

I should note here that the model I'll get out of training on my WSL and InstructLab environment will be pretty low fidelity. I only have a RTX3070 and 64GB RAM available for training, which isn't going to give me a production, high-fidelity model. It will let me experiment with InstructLab and start to understand the framework better though.

If I want to create a high-fidelity model, I'm going to need a lot more VRAM. The initial testing from Red Hat indicates that 320GB VRAM (4 x NVIDIA H100 GPUs) or equivalent is needed to complete and end-to-end InstructLab run in a reasonable amount of time.

NVIDIA CUDA

I have an NVIDIA GPU available, and one of the core capabilities I'm using to make InstructLab "go fast" is NVIDIA CUDA.

NVIDIA CUDA is a parallel computing platform and API model that allows developers to use an NVIDIA GPU's massive parallel processing power to accelerate applications. It provides a high-level programming interface and runtime environment, making it easier to write and execute code on GPUs.

I don't need to explicitly write any new code to make InstructLab "go fast", but InstructLab relies on CUDA support within two libraries:

LLaMA, supporting inference for Meta's LLaMA models using the python bindings
PyTorch, a machine learning library.

Getting started with InstructLab and WSL

Ok! I've followed the article here and I have a WSL Fedora 40 environment I can use for my InstructLab experiments.

The first thing I need to do is create a new directory to store the files the InstructLab CLI (ilab) needs when running:

mkdir instructlab
cd instructlab

Now let's install dependencies and setup a Python virtual environment. Note that InstructLab doesn't yet support python 3.12, and we need to explicitly install python 3.11 to support the latest InstructLab release.

sudo dnf install gcc gcc-c++ make git python3.11 python3.11-devel
python3.11 -m venv --upgrade-deps venv
source venv/bin/activate
pip install instructlab

InstructLab is now installed, but we need to configure it. Run the following command and follow the prompts:

ilab config init

Welcome to InstructLab CLI. This guide will help you to setup your environment.
Please provide the following values to initiate the environment [press Enter for defaults]:
Path to taxonomy repo [/home/user/.local/share/instructlab/taxonomy]:
`/home/user/.local/share/instructlab/taxonomy` seems to not exist or is empty. Should I clone https://github.com/instructlab/taxonomy.git for you? [Y/n]: Y
Cloning https://github.com/instructlab/taxonomy.git...
Path to your model [/home/user/.cache/instructlab/models/merlinite-7b-lab-Q4_K_M.gguf]:
Generating `/home/user/.config/instructlab/config.yaml`...
Please choose a train profile to use:
[0] No profile (CPU-only)
[1] A100_H100_x2.yaml
[2] A100_H100_x4.yaml
[3] A100_H100_x8.yaml
[4] L40_x4.yaml
[5] L40_x8.yaml
[6] L4_x8.yaml
Enter the number of your choice [hit enter for the default CPU-only profile] [0]: 0
Using default CPU-only train profile.
Initialization completed successfully, you're ready to start using `ilab`. Enjoy!

Now let's download a model. The merlinite-7b model is about 5GB, and this command will take a while to complete:

ilab model download

Downloading model from Hugging Face: instructlab/merlinite-7b-lab-GGUF@main to /home/user/.cache/instructlab/models...
Downloading 'merlinite-7b-lab-Q4_K_M.gguf' to '/home/user/.cache/instructlab/models/.cache/huggingface/download/merlinite-7b-lab-Q4_K_M.gguf.9ca044d727db34750e1aeb04e3b18c3cf4a8c064a9ac96cf00448c506631d16c.incomplete'
INFO 2024-09-20 10:29:40,833 huggingface_hub.file_download:1908: Downloading 'merlinite-7b-lab-Q4_K_M.gguf' to '/home/user/.cache/instructlab/models/.cache/huggingface/download/merlinite-7b-lab-Q4_K_M.gguf.9ca044d727db34750e1aeb04e3b18c3cf4a8c064a9ac96cf00448c506631d16c.incomplete'
merlinite-7b-lab-Q4_K_M.gguf:  22%|███████████████                                                      | 954M/4.37G [00:36<02:09, 26.3MB/s]

Now that we have a model we can start chatting:

ilab model chat

INFO 2024-09-20 10:35:24,939 instructlab.model.backends.llama_cpp:104: Trying to connect to model server at http://127.0.0.1:8000/v1
╭───────────────────────────────────────────────────────────────── system ─────────────────────────────────────────────────────────────────╮
│ Welcome to InstructLab Chat w/ MERLINITE-7B-LAB-Q4_K_M.GGUF (type /h for help)                                                           │
╰──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
>>> Hello. What's your name?                                                                                                    [S][default]
╭────────────────────────────────────────────────────── merlinite-7b-lab-Q4_K_M.gguf ──────────────────────────────────────────────────────╮
│ Thank you for asking! I am Red Hat® Instruct Model based on Granite 7B. It's great to meet you and have the opportunity to assist you    │
│ with any questions or concerns you might have related to Red Hat products, technologies, or services. Please feel free to ask me         │
│ anything about these topics, and I will do my best to provide accurate and helpful information.                                          │
│                                                                                                                                          │
│ If you have any general inquiries, I'm here to help! However, for specific technical support or product-related questions, it would be   │
│ more appropriate to contact the Red Hat Customer Portal or the relevant Red Hat support team. They can offer tailored assistance and     │
│ guidance based on your unique situation.                                                                                                 │
╰───────────────────────────────────────────────────────────────────────────────────────────────────────────────── elapsed 21.049 seconds ─╯
>>>                                                                                                                             [S][default]

Great! It looks like we can chat with the model. Let's check that we have GPU support for InstructLab.

ilab system info

sys.version: 3.11.10 (main, Sep  9 2024, 00:00:00) [GCC 14.2.1 20240801 (Red Hat 14.2.1-1)]
sys.platform: linux
os.name: posix
...

Oof, that's a huge amount of information. Let's unpick it.

This component is straight-forward; we're using Fedora 40 on WSL, with python 3.11.10.

platform.release: 5.15.153.1-microsoft-standard-WSL2
platform.machine: x86_64
platform.node: HelixWin
platform.python_version: 3.11.10
os-release.ID: fedora
os-release.VERSION_ID: 40
os-release.PRETTY_NAME: Fedora Linux 40 (Container Image)

We're using InstructLab 0.18.4 with associated dependencies:

instructlab.version: 0.18.4
instructlab-dolomite.version: 0.1.1
instructlab-eval.version: 0.1.2
instructlab-quantize.version: 0.1.0
instructlab-schema.version: 0.3.1
instructlab-sdg.version: 0.2.7
instructlab-training.version: 0.4.2

PyTorch has been compiled with support for CUDA, and is able to access the NVIDIA RTX3070 on this system:

torch.version: 2.3.1+cu121
torch.backends.cpu.capability: AVX2
torch.version.cuda: 12.1
torch.version.hip: None
torch.cuda.available: True
torch.backends.cuda.is_built: True
torch.backends.mps.is_built: False
torch.backends.mps.is_available: False
torch.cuda.bf16: True
torch.cuda.current.device: 0
torch.cuda.0.name: NVIDIA GeForce RTX 3070
torch.cuda.0.free: 6.9 GB
torch.cuda.0.total: 8.0 GB

This part is a bit concerning though. InstructLab requires llama-cpp, and it currently doesn't have any GPU offload support:

llama_cpp_python.version: 0.2.79
llama_cpp_python.supports_gpu_offload: False

Let's see if we can fix this and really get InstructLab 'going fast'!

Making InstructLab 'go fast' on WSL

As we saw above, currently my ilab install is not using the NVIDIA CUDA cores available - it's doing a lot of its processing in the CPU. This means that synthentic data generation will take a very long time, and slow down adding new skills and knowledge to this model. If we want to make InstructLab "go fast", we need to:

Build a version of the llama-cpp python bindings with CUDA support for this system
Configure InstructLab to use the CUDA-enabled llama-cpp python bindings, instead of the default.

Installing NVIDIA CUDA toolkit on Fedora 40 on WSL

To use InstructLab with NVIDIA CUDA, we need a driver. One of the great things about WSL is that the NVIDIA Windows GPU Driver fully supports WSL 2. Because there is CUDA support in the driver, existing applications (compiled elsewhere on a Linux system for the same target GPU) can run unmodified within the WSL environment.

To build the llama-cpp python bindings with CUDA support for WSL we need a CUDA toolkit. Fedora 40 ships with GCC 14, and CUDA does not yet support GCC 14, so we need to install GCC 13. You can find a guide to build GCC 13 on Fedora 40 here: https://www.if-not-true-then-false.com/2023/fedora-build-gcc/

It took a couple of hours to build GCC 13 on my WSL setup, and once you're finished you should be able to see that both GCC 13 and 14 are available.

$ gcc -v
Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/usr/libexec/gcc/x86_64-redhat-linux/14/lto-wrapper
...
gcc version 14.2.1 20240912 (Red Hat 14.2.1-3) (GCC)

$ gcc-13.2 -v
Using built-in specs.
COLLECT_GCC=gcc-13.2
COLLECT_LTO_WRAPPER=/usr/libexec/gcc/x86_64-redhat-linux/13/lto-wrapper
...
gcc version 13.2.0 (GCC)

Let's update the path by adding the following to ~/.bashrc

case ":${PATH}:" in
  *:"/usr/local/cuda/bin":*)
    ;;
  *)
    PATH=/usr/local/cuda/bin:$PATH
esac

case ":${LD_LIBRARY_PATH}:" in
  *:"/usr/local/cuda/lib64":*)
    ;;
  *)
  if [ -z "${LD_LIBRARY_PATH}" ] ; then
    LD_LIBRARY_PATH=/usr/local/cuda/lib64:/usr/local/cuda/extras/CUPTI/lib64
  else
    LD_LIBRARY_PATH=/usr/local/cuda/lib64:/usr/local/cuda/extras/CUPTI/lib64:$LD_LIBRARY_PATH
  fi
esac

HOST_COMPILER='gcc-13.2 -lstdc++ -lm'

export PATH LD_LIBRARY_PATH HOST_COMPILER

Now that we have GCC 13 available we can install the rest of the CUDA Toolkit on WSL. This is where we need to be really careful though. We need to install the NVIDIA CUDA toolkit, but not install the NVIDIA driver that comes packaged with the toolkit. This is because the CUDA driver installed on the Windows host will be stubbed inside the WSL 2 environment as libcuda.so, and we don't want to overwrite this.

The NVIDIA team has created a CUDA release without the GPU driver shown as WSL-Ubuntu, which exactly supports our use case. If you navigate to the NVIDIA CUDA Toolkit downloads page you can see an option for WSL-Ubuntu. This is the download that you want.

Click the runfile (local) link and follow the instructions to install the CUDA Toolkit.

Note that you will likely see an error like this in the logs when trying to run the Linux installer:

$ wget https://developer.download.nvidia.com/compute/cuda/12.6.1/local_installers/cuda_12.6.1_560.35.03_linux.run
cuda_12.6.1_560.35.0 100% [=====================================================================>]    4.04G   51.69MB/s
                          [Files: 1  Bytes: 4.04G [56.94MB/s] Redirects: 0  Todo: 0  Errors: 0   ]

$ sudo sh cuda_12.6.1_560.35.03_linux.run
Failed to verify gcc version. See log at /var/log/cuda-installer.log for details.

The log shows this:

[INFO]: Checking compiler version...
[INFO]: gcc location: sh: line 1: which: command not found

[ERROR]: Missing gcc. gcc is required to continue.

Let's install which and try the CUDA installation again:

$ sudo dnf install /usr/bin/which
$ sudo sh cuda_12.6.1_560.35.03_linux.run

You should see the following when the script runs successfully:

===========
= Summary =
===========

Driver:   Not Selected
Toolkit:  Installed in /usr/local/cuda-12.6/

Please make sure that
 -   PATH includes /usr/local/cuda-12.6/bin
 -   LD_LIBRARY_PATH includes /usr/local/cuda-12.6/lib64, or, add /usr/local/cuda-12.6/lib64 to /etc/ld.so.conf and run ldconfig as root

To uninstall the CUDA Toolkit, run cuda-uninstaller in /usr/local/cuda-12.6/bin
***WARNING: Incomplete installation! This installation did not install the CUDA Driver. A driver of version at least 560.00 is required for CUDA 12.6 functionality to work.
To install the driver using this installer, run the following command, replacing <CudaInstaller> with the name of this run file:
    sudo <CudaInstaller>.run --silent --driver

Logfile is /var/log/cuda-installer.log

You can ignore the warning about the driver installation, as it's already supported by the Windows host NVIDIA driver.

InstructLab install - revisited

Ok! At this point we have CUDA happily installed using the correct GCC version and without the NVIDIA driver, and we've rebuilt the llama-cpp python bindings to use CUDA. Let's try an InstructLab installation now using the CUDA libraries:

# setup virtual env
source ~/instructlab/venv/bin/activate

# Verify CUDA can be found in your PATH variable
export CUDA_HOME=/usr/local/cuda
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda/lib64:/usr/local/cuda/extras/CUPTI/lib64
export PATH=$PATH:$CUDA_HOME/bin

# Recompile llama-cpp-python using CUDA
pip cache remove llama_cpp_python
sudo dnf install clang17
CUDAHOSTCXX=$(which clang++-17) pip install --force-reinstall llama_cpp_python==0.2.79 -C cmake.args="-DLLAMA_CUDA=on"

# Re-install InstructLab with CUDA
pip install instructlab[cuda]

Finally, let's check ilab system info and make sure that GPU support is enabled.

$ ilab system info
...
llama_cpp_python.version: 0.2.79
llama_cpp_python.supports_gpu_offload: True

Great! We've added GPU support for InstructLab, and synthetic data generation shouldn't take long now! Let's try it out.

I looked at a health-related application on OpenShift a few months ago, so let's continue the theme and look at a similar InstructLab example. Create a new directory for a health skill:

mkdir -p ~/.local/share/instructlab/taxonomy/knowledge/health/anatomy_tonsil

Add the following file into ~/.local/share/instructlab/taxonomy/knowledge/health/anatomy_tonsil/qna.yaml:

created_by: lukeinglis
domain: anatomy_tonsil
version: 3
seed_examples:
  - context: |
      ## Structure
      Humans are born with four types of tonsils: the pharyngeal tonsil, two
      tubal tonsils, two palatine tonsils, and the lingual tonsils.[1]

      <table>
      <thead>
      <tr class="header">
      <th><p>Type</p></th>
      <th><p><a href="Epithelium" title="wikilink">Epithelium</a></p></th>
      <th><p><a href=":wikt:capsule" title="wikilink">Capsule</a></p></th>
      <th><p><a href="Tonsillar_crypts" title="wikilink">Crypts</a></p></th>
      <th><p>Location</p></th>
      </tr>
      </thead>
      <tbody>
      <tr class="odd">
      <td><p><a href="Adenoid" title="wikilink">Pharyngeal tonsil</a> (also
      termed "adenoid")</p></td>
      <td><p><a href="pseudostratified_epithelium" title="wikilink">Ciliated
      pseudostratified columnar</a> (<a href="respiratory_epithelium"
      title="wikilink">respiratory epithelium</a>)</p></td>
      <td><p>Incompletely encapsulated</p></td>
      <td><p>Small folds—sometimes described as crypts<a href="#fn1"
      class="footnote-ref" id="fnref1"
      role="doc-noteref"><sup>1</sup></a></p></td>
      <td><p>Roof of <a href="pharynx" title="wikilink">pharynx</a></p></td>
      </tr>
      <tr class="even">
      <td><p><a href="Tubal_tonsils" title="wikilink">Tubal tonsils</a></p></td>
      <td><p>Ciliated pseudostratified columnar (respiratory epithelium)</p></td>
      <td><p>Not encapsulated</p></td>
      <td><p>No crypts</p></td>
      <td><p>Roof of pharynx</p></td>
      </tr>
      <tr class="odd">
      <td><p><a href="Palatine_tonsils" title="wikilink">Palatine tonsils</a></p></td>
      <td><p>Stratified squamous epithelium</p></td>
      <td><p>Fully encapsulated</p></td>
      <td><p>Multiple deep crypts</p></td>
      <td><p>Each side of the throat at the back of the mouth</p></td>
      </tr>

    questions_and_answers:
      - question: What is the location of the tubal tonsils?
        answer: The location of the tubal tonsils is the roof of the pharynx.
      - question: |
          Compare the epithelial types, encapsulation, and presence of
          crypts in the pharyngeal, tubal, and palatine tonsils according to the
          table provided.
        answer: |
          The pharyngeal tonsil features ciliated pseudostratified columnar
          epithelium and is incompletely encapsulated with small folds sometimes
          described as crypts. The tubal tonsils also have ciliated
          pseudostratified columnar epithelium but are not encapsulated and do
          not possess crypts. In contrast, the palatine tonsils are covered with
          stratified squamous epithelium, are fully encapsulated, and contain
          multiple deep crypts. These structural differences are indicative of
          their varied anatomical locations and potentially their distinct
          functions within the immune system.
      - question: What type of epithelium is found in the pharyngeal tonsil?
        answer: |
          The type of epithelium found in the pharyngeal tonsil is ciliated
          pseudostratified columnar (respiratory epithelium).


  - context: |
      The **tonsils** are a set of [lymphoid](Lymphatic_system "wikilink")
      organs facing into the aerodigestive tract, which is known as
      [Waldeyer's tonsillar ring](Waldeyer's_tonsillar_ring "wikilink") and
      consists of the [adenoid tonsil](adenoid "wikilink") (or pharyngeal
      tonsil), two [tubal tonsils](tubal_tonsil "wikilink"), two [palatine
      tonsils](palatine_tonsil "wikilink"), and the [lingual
      tonsils](lingual_tonsil "wikilink"). These organs play an important role
      in the immune system.

    questions_and_answers:
      - question: What is the immune system's first line of defense?
        answer: |
          The tonsils are the immune system's first line of defense against
          ingested or inhaled foreign pathogens.
      - question: What is Waldeyer's tonsillar ring?
        answer: |
          Waldeyer's tonsillar ring is a set of lymphoid organs facing into the
          aerodigestive tract, consisting of the adenoid tonsil, two tubal
          tonsils, two palatine tonsils, and the lingual tonsils.
      - question: How many tubal tonsils are part of Waldeyer's tonsillar ring?
        answer: There are two tubal tonsils as part of Waldeyer's tonsillar ring.

  - context: |
      The palatine tonsils tend to reach their largest size in [puberty](puberty
      "wikilink"), and they gradually undergo [atrophy](atrophy "wikilink")
      thereafter. However, they are largest relative to the diameter of the
      throat in young children. In adults, each palatine tonsil normally
      measures up to 2.5 cm in length, 2.0 cm in width and 1.2 cm in
      thickness.[2]

    questions_and_answers:
      - question: When do the palatine tonsils tend to reach their largest size?
        answer: The palatine tonsils tend to reach their largest size in puberty.
      - question: What are the typical dimensions of an adult palatine tonsil?
        answer: |
          In adults, each palatine tonsil normally measures up to 2.5 cm in
          length, 2.0 cm in width, and 1.2 cm in thickness.
      - question: How do the palatine tonsils change in size with age?
        answer: |
          The palatine tonsils tend to gradually undergo atrophy after puberty,
          becoming smaller in size compared to their dimensions in young
          children.

  - context: |
      The tonsils are immunocompetent organs that serve as the immune system's
      first line of defense against ingested or inhaled foreign pathogens, and
      as such frequently engorge with blood to assist in immune responses to
      common illnesses such as the common cold. The tonsils have on their
      surface specialized antigen capture cells called [microfold
      cells](microfold_cell "wikilink") (M cells) that allow for the uptake of
      antigens produced by pathogens. These M cells then alert the B cells and T
      cells in the tonsil that a pathogen is present and an immune response is
      stimulated.[3] B cells are activated and proliferate in areas called
      germinal centers in the tonsil. These germinal centers are places where B
      memory cells are created and [secretory antibody (IgA)](Immunoglobulin_A
      "wikilink") is produced.

    questions_and_answers:
      - question: |
          What are the specialized antigen capture cells on the surface of the
          tonsils called?
        answer: |
          The specialized antigen capture cells on the surface of the tonsils
          are called microfold cells (M cells).
      - question: What is the role of microfold cells in the tonsils?
        answer: |
          Microfold cells (M cells) allow for the uptake of antigens produced by
          pathogens. They alert the B cells and T cells in the tonsil that a
          pathogen is present, stimulating an immune response.
      - question: Where do B cells proliferate in the tonsils?
        answer: B cells proliferate in areas called germinal centers in the tonsils.

  - context: |
      A [tonsillolith](tonsillolith "wikilink") (also known as a "tonsil stone")
      is material that accumulates on the palatine tonsil. This can reach the
      size of a [peppercorn](peppercorn "wikilink") and is white or cream in
      color. The main substance is mostly [calcium](calcium "wikilink"), but it
      has a strong unpleasant odor because of [hydrogen
      sulfide](hydrogen_sulfide "wikilink") and [methyl
      mercaptan](methyl_mercaptan "wikilink") and other chemicals.[6]

    questions_and_answers:
      - question: What is a tonsillolith?
        answer: |
          A tonsillolith (tonsil stone) is material that accumulates on the
          palatine tonsil, reaching the size of a peppercorn and having a white
          or cream color. It contains calcium and has a strong unpleasant odor
          due to hydrogen sulfide, methyl mercaptan, and other chemicals.
      - question: What is the main substance found in a tonsillolith?
        answer: The main substance found in a tonsillolith is mostly calcium.
      - question: Why do tonsilloliths have a strong unpleasant odor?
        answer: |
          Tonsilloliths have a strong unpleasant odor due to hydrogen sulfide,
          methyl mercaptan, and other chemicals.

document_outline: |
  Overview of Human tonsils, describing their types, locations, structure,
  function, and clinical significance, with a specific focus on their role in
  the immune system and related health issues.

document:
  repo: https://github.com/luke-inglis/il-anatomy-knowledge
  commit: cc7c6ca
  patterns:
    - anatomy1.md

Before we create some new data, let's check that this taxonomy is valid:

$ ilab taxonomy diff
knowledge/health/anatomy_tonsil/qna.yaml
Taxonomy in /home/user/.local/share/instructlab/taxonomy is valid :)

Ok! Now we can run ilab data generate --gpus 1 and verify with the task manager that the GPU is being used for InstructLab data generation.

Assuming all goes well, you should see that InstructLab has created synthetic data for this use case:

INFO 2024-09-21 23:09:49,954 instructlab.sdg.datamixing:123: Loading dataset from /home/user/.local/share/instructlab/datasets/node_datasets_2024-09-21T23_04_44/knowledge_health_anatomy_tonsil_p10.jsonl 
...
INFO 2024-09-21 23:09:50,352 instructlab.sdg.datamixing:200: Mixed Dataset saved to /home/user/.local/share/instructlab/datasets/skills_train_msgs_2024-09-21T23_04_44.jsonl
...
INFO 2024-09-21 23:09:50,352 instructlab.sdg:438: Generation took 307.03s

You can check the samples that have been created by looking at the jsonl files referenced in the InstructLab output:

$ head -n 1 /home/user/.local/share/instructlab/datasets/skills_train_msgs_2024-09-21T23_04_44.jsonl
{"messages":[{"content":"I am, Red Hat\u00ae Instruct Model based on Granite 7B, an AI language model developed by Red Hat and IBM Research, based on the Granite-7b-base language model. My primary function is to be a chat assistant.","role":"system"},{"content":"Document:\n## Clinical significance\n\n![Gross pathology of fresh hypertrophic tonsil. Top left: Surface facing\nthe into the aerodigestive tract. Top right: Opposite surface\n(cauterized). Bottom: Cut\nsections.](Gross_pathology_of_tonsil.jpg \"Gross pathology of fresh hypertrophic tonsil. 
...

Success! We've been able to accelerate InstructLab using NVIDIA CUDA on WSL, and data generation will take significantly less time.

Training a model

Training a model is where things start to get really interesting with WSL. Training is memory intensive, and usually requires that all data fits in the GPU memory (VRAM). This requires at least 17GB of free VRAM.

I don't have a GPU with 17GB of free VRAM. But, on the Windows Subsystem for Linux (WSL) NVIDIA CUDA can access both the VRAM and host memory via Unified Shared Memory (USM). How great is that!

I'm not going to go too much into USM in this article; you can find an excellent guide here from Joel Joseph. This does mean that I take a performance hit on training - but I can deal with this if it means not purchasing a 24GB GPU.

To train this new model we can simply run ilab model train --device 'cuda' --gpus 1:

ilab model train --device 'cuda' --gpus 1 --model-path instructlab/granite-7b-lab --data-path ~/.local/share/instructlab/datasets/ --legacy

You should see that the model training kicks off using the CUDA backend.

INFO 2024-09-22 09:35:01,343 numexpr.utils:161: NumExpr defaulting to 16 threads.
INFO 2024-09-22 09:35:01,562 datasets:54: PyTorch version 2.3.1 available.
LINUX_TRAIN.PY: NUM EPOCHS IS:  10
LINUX_TRAIN.PY: TRAIN FILE IS:  /home/user/.local/share/instructlab/datasets/train_gen.jsonl
LINUX_TRAIN.PY: TEST FILE IS:  /home/user/.local/share/instructlab/datasets/test_gen.jsonl
LINUX_TRAIN.PY: Using device 'cuda:0'
  NVidia CUDA version: 12.1
  AMD ROCm HIP version: n/a
  cuda:0 is 'NVIDIA GeForce RTX 3070' (6.9 GiB of 8.0 GiB free, capability: 8.6)

Once the model is loaded, you can see in the Windows task manager that we're using a combination of GPU VRAM and host RAM ('Shared GPU memory') to support the model training.

Once your model training finishes you should see a message like this:

[276/291] Writing tensor blk.30.ffn_up.weight                   | size  11008 x   4096  | type F16  | T+  61
[277/291] Writing tensor blk.30.ffn_norm.weight                 | size   4096           | type F32  | T+  61
[278/291] Writing tensor blk.30.attn_k.weight                   | size   4096 x   4096  | type F16  | T+  61
[279/291] Writing tensor blk.30.attn_output.weight              | size   4096 x   4096  | type F16  | T+  61
[280/291] Writing tensor blk.30.attn_q.weight                   | size   4096 x   4096  | type F16  | T+  62
[281/291] Writing tensor blk.30.attn_v.weight                   | size   4096 x   4096  | type F16  | T+  64
[282/291] Writing tensor blk.31.attn_norm.weight                | size   4096           | type F32  | T+  64
[283/291] Writing tensor blk.31.ffn_down.weight                 | size   4096 x  11008  | type F16  | T+  65
[284/291] Writing tensor blk.31.ffn_gate.weight                 | size  11008 x   4096  | type F16  | T+  65
[285/291] Writing tensor blk.31.ffn_up.weight                   | size  11008 x   4096  | type F16  | T+  65
[286/291] Writing tensor blk.31.ffn_norm.weight                 | size   4096           | type F32  | T+  65
[287/291] Writing tensor blk.31.attn_k.weight                   | size   4096 x   4096  | type F16  | T+  65
[288/291] Writing tensor blk.31.attn_output.weight              | size   4096 x   4096  | type F16  | T+  65
[289/291] Writing tensor blk.31.attn_q.weight                   | size   4096 x   4096  | type F16  | T+  65
[290/291] Writing tensor blk.31.attn_v.weight                   | size   4096 x   4096  | type F16  | T+  65
[291/291] Writing tensor output_norm.weight                     | size   4096           | type F32  | T+  65
Wrote training_results/final/ggml-model-f16.gguf
Save trained model to /home/user/.local/share/instructlab/checkpoints/ggml-model-f16.gguf

Chatting with the new model

Now that our new model is created let's try it out! Open the ~/.config/instructlab/config.yaml file and update the model field under serve:

serve:
  backend: llama-cpp
  chat_template: null
  host_port: 127.0.0.1:8000
  llama_cpp:
    gpu_layers: 35
    llm_family: ''
    max_ctx_size: 4096
  model_path: /home/user/.local/share/instructlab/checkpoints/ggml-model-f16.gguf
  #model_path: /home/user/.cache/instructlab/models/merlinite-7b-lab-Q4_K_M.gguf

Now you can try chatting with the new model:

(venv) [user@HelixWin ~]$ ilab model chat -m /home/user/.local/share/instructlab/checkpoints/ggml-model-f16.gguf
INFO 2024-09-22 19:29:23,472 instructlab.model.backends.llama_cpp:104: Trying to connect to model server at http://127.0.0.1:8000/v1
╭──────────────────────────────────────────────────────────── system ─────────────────────────────────────────────────────────────╮
│ Welcome to InstructLab Chat w/ GGML-MODEL-F16.GGUF (type /h for help)                                                           │
╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
>>> What is a tonsillolith?                                                                                            [S][default]
╭────────────────────────────────────────────────────── ggml-model-f16.gguf ──────────────────────────────────────────────────────╮
│                                                                                                                                │
│ Answer: A tonsillolith (tonsil stone) is material that accumulates on the palatine tonsils, reaching the size of a peppercorn   │
│ and having a white or cream color. It can cause pain when eating or talking and may indicate infection with pathogens such as   │
│ H. influenzae or Streptococcus pneumoniae. Tonsilloliths are common in adolescence and young adulthood, with risk factors       │
│ including unhealthy diet, smoking, and exposure to secondary smoke. They can be removed by a healthcare professional or may     │
│ fall off; if they block the airway, surgical removal may be necessary.                                                          │
╰──────────────────────────────────────────────────────────────────────────────────────────────────────── elapsed 92.676 seconds ─╯

You can see that the response from the model is a combination of the new knowledge we provided, together with existing knowledge embedded within the model about management and surgical intervention (which wasn't referenced in the training data):

- context: |
      A [tonsillolith](tonsillolith "wikilink") (also known as a "tonsil stone")
      is material that accumulates on the palatine tonsil. This can reach the
      size of a [peppercorn](peppercorn "wikilink") and is white or cream in
      color. The main substance is mostly [calcium](calcium "wikilink"), but it
      has a strong unpleasant odor because of [hydrogen
      sulfide](hydrogen_sulfide "wikilink") and [methyl
      mercaptan](methyl_mercaptan "wikilink") and other chemicals.[6]

    questions_and_answers:
      - question: What is a tonsillolith?
        answer: |
          A tonsillolith (tonsil stone) is material that accumulates on the
          palatine tonsil, reaching the size of a peppercorn and having a white
          or cream color. It contains calcium and has a strong unpleasant odor
          due to hydrogen sulfide, methyl mercaptan, and other chemicals.
      - question: What is the main substance found in a tonsillolith?
        answer: The main substance found in a tonsillolith is mostly calcium.
      - question: Why do tonsilloliths have a strong unpleasant odor?
        answer: |
          Tonsilloliths have a strong unpleasant odor due to hydrogen sulfide,
          methyl mercaptan, and other chemicals.

Wrapping up

In this article I looked at a new open source framework for adding knowledge and skills to an existing LLM, InstructLab. I showed how you can enable GPU support for InstructLab within the Windows Subsystem for Linux (WSL), and how to train and chat with a new model created with InstructLab.

Using InstructLab with the Windows Subsystem for Linux is a great way to get up-and-running, and trying this out on a "gaming rig", or any other desktop. InstructLab is the foundation for Red Hat Enterprise Linux AI (RHEL AI), and in future articles I'll look at using RHEL AI and OpenShift AI to create and serve domain-specific AI models. Stay tuned!