LLM Strategy & Product Design In Plain English

For Executives, Investors & Entrepreneurs

Daniel Sexton
DataDrivenInvestor

--

dall e 3, author

You do not need to be technical to design a defensible LLM product or to understand how LLMs work. This article describes all you need to know to build competitive products and a viable investment thesis.

Where Are LLMs Headed?

Some of the smartest people on earth sit around all day and half the night thinking about potential business cases for LLMs. They haven’t produced that much yet. LLMs are at the pong stage.

source: https://www.wired.com/story/inside-story-of-pong-excerpt/

Part of this is because while large models, such as ChatGPT, are impressive, there aren’t that many business cases. They are extremely courteous and enthusiastic well-educated interns who are highly ethical unless you tell them not to be, whereas in most situations, you need an accountable professional expert for a specific task.

I take that back, they can write code. I have used ChatGPT to, ironically, write a chatbot that uses ChatGPT as a model. It wrote almost all of the code in React, a language I don’t write in. I served as a liaison, constantly sending code back to correct errors — I felt like a project manager. Pretty amazing. Beyond coding, you can use LLMs to brainstorm ideas, write essays for philosophy class, create spurious recipes, summarize a financial statement, and do all kinds of things that few people appear to be willing to pay for.

Another reason the smart folks haven’t come up with much is that it’s unclear which direction LLMs will go.

Will LLMs become a few broad general-purpose applications built on massive data sets — like ChatGPT or Claude, but with capabilities that can address almost any problem accurately? At what point does adding more parameters, larger and better datasets, and more computing power reach a point of diminishing marginal returns or no returns?

Or will smaller models that have been finely tuned to a specific business case be more effective? The latter implies a market of hundreds or thousands of specific applications built on smaller models tailored to particular business scenarios.

For example, suppose you’re the CEO of the fictional FinWiseMax.ai, which you will be later in this article, and you collect lots of proprietary data and fine-tune a small model to build an app that helps consumers purchase Indexed Annuities. And then OpenAI releases a new huge general model that solves this problem better. Whoops!

Another consideration: AI agents, sometimes referred to as “read/write AI” (as opposed to “read-only” for chatbots), are proposed to be capable of autonomously completing comprehensive tasks across workflows. Will these agents be huge models that evolve to solve nearly any problem, or will they use finely tuned smaller models and focus on specific tasks and business cases?

In general, large models perform better, yet a more balanced approach has recently brought smaller models, such as Meta’s recently released 8 billion parameter LLaMA-3, back into the picture. In 2022, DeepMind released what’s become known as The Chinchilla Paper which proposed a new model that shows excellent performance improvements by optimizing the balance between dataset size, model size, and compute capabilities. Generally, however, adding more data, computing power, and parameters to a model improves the overall performance which seems to have led some to propose that building larger and larger models will eventually lead to models that can solve basically any problem (AGI).

I’m betting on a combination of the two surviving which favors smaller LLMs for specific use cases. Part of this is because every time anyone creates a new acronym or stupid name, such as LLaMA, or new research or business knowledge, you have to retrain your model or it’s outdated. Also, it’s partly because I don’t have 7 trillion dollars to create sentient AI — yet. I’ll have to hope that companies I back or create get acquired if AGI becomes a reality.

I, for one, welcome our new AGI overlords.

source: https://www.youtube.com/watch?v=W4jWAwUb63c

Also, I tend to think of generative AI as an enabling layer rather than a product that a user interacts with, such as a chatbot. An enabling layer is a technology that supports other software and allows users to work more efficiently; an example is operating systems (Windows, macOS, Android, iOS, Linux, etc).

Another example of an enabling layer is databases. Databases, like all enabling layers, have lifecycles. They evolve and mature, eventually into commodities. This is important for your investment thesis. There are dozens of models and components that are all evolving and it helps to consider where they will be later.

source: author

The concept behind databases evolved something like this:

Evolution of Data Storage: Cuneiform/Sumerians to store transactions, receipts (3500 BCE) → … skip some stuff… → Card catalogs in libraries (1860s)→ Rolodexes (1950s) → flat files/text files/Cobol (1950s-1960s) → Hierarchical Database (1960s-1970s) → Relational Databases (1980s-1990s) → NoSQL Databases (2010s) → NewSQL (today)

Over time, we gained the ability to manipulate data in more and more sophisticated ways. Originally, writing on clay tablets enabled humans to conduct business without people having to remember everything. Databases solve this original problem; they’re just a lot more sophisticated. Today, databases typically do not involve a user interacting directly with a database table; instead, they sit underneath that interface and enable transactions.

Similarly, generative AI gives us a more advanced ability to manipulate structured and unstructured data. As an example, consider the evolution of image manipulation by humans (picking things at random here so this is rough):

Evolution of Image Manipulation: Chauvet Cave drawings (30,000 BCE) → Egyptian Hieroglyphs (3200 BCE) → Medieval Paintings (500s) → Photographs (1826) → Halftone Printing (1880s) → Photoshop (1988) → Dall-e (today)

The form changes but the essence persists.

In the same way that we don’t think of Walmart as a database company, or your phone as “Picture Element” technology (pixel), in the future we won’t think of a company being AI-driven since that will be implied to be in the value chain.

How LLMs Work

Unlike traditional software which is deterministic, LLMs and generative AI are probabilistic — they use statistics to generate results. An LLM is actually a statistical word calculator which means there are no ‘if x, then y’ statements. So there’s no way to understand or dictate how the model responds to certain inputs. You can guide and suggest, but you cannot force or understand.

As an example, search engines probably seem probabilistic to most people (they aren’t) because the user types in a search phrase and a complex process of schemas and tokenization produces a result that humans can’t follow the logic of. I once wrote a search engine API using Solr that served 9 million SKUs across countless categories. The CIO never could seem to wrap her head around the fact that you could not dictate search results; you could only suggest them to the engine, to the dismay of many suppliers.

But LLMs are truly probabilistic. This means that finding and defining business cases is tricky. If your chatbot decides to tell a user that stocks always go up or that a stroke is a headache, you and the user could both be in trouble. The results need to be useful, accurate, and legal.

Let’s use ChatGPT as an example to explain the probabilistic nature of LLMs. As you probably know, ChatGPT works like this:

  1. You enter text called a prompt (passed into a context window that can include the last several prompts)
  2. It is sent to the model for ‘inference
  3. It responds with text (completion)

So what’s going on in this black box?

As I mentioned before, GPT is a statistical word calculator. It predicts the next most likely word in a sequence based on the data you trained it on. So, you enter a prompt, and the model produces the next word, and then the next, and so on. It doesn’t ‘think through’ the answer (yet).

  • Prompt: Respond with one word. The cat in the
  • Inference: hums softly… calculating…
  • Completion: Hat!

‘Hat’ is the response not because it makes any logical sense in itself but because the model was trained on data that included references to the book and so it is statistically the most likely next word to occur.

The prompt is first turned into tokens (words or parts of words) and then the tokens are associated with numbers to speed up the processing. Each word/token/number is weighted against every other word/token/number in the prompt which yields an Attention Map. This map gives the model the ability to focus on words that are more important than other words. It looks something like this:

https://www.comet.com/site/blog/explainable-ai-for-transformers/

Since words can be used in lots of contexts, each word is also associated with a vector of weights — a list of numbers: e.g. Robot: (0.25, 0.54, 0.77…). The weights represent different meanings and contexts. For example, the word “robot” may have these weights:

  • Weight 1 (0.25): Relevance to industrial automation, manufacturing processes, and assembly lines.
  • Weight 2 (0.54): Relevance to science fiction, futuristic technology, artificial intelligence, and humanoid machines.
  • Weight 3 (0.77): Relevance to domestic robotics related to smart homes, household appliances, and autonomous devices.
  • Weight n (…): And so on. The Google paper refers to 512 weights but often there are more

Weights give context and depth to words in the prompt so the intended meanings can be built into an Attention Map that is passed into the model for inference. The tokens/words in the trained model itself also have weights. This is all crunched mathematically using what may best be described as brute force and results in the output/completion.

Right now is an excellent opportunity to discover if you think more with the left or right hemisphere of your brain. If you feel compelled to dig deeper into the mechanics of LLMs, then you are left-hemisphered and you should have been an engineer or are one.

If you are a normal person, however, your brain will have no further interest in LLM mechanics, and will never seek more information because this is already too much.

Designing Your LLM Product

For the next few minutes, you will be the CEO of FinWiseMax.ai. You have raised $750k pre-seed (great job!) to build financial advisory AI software. Now you have to get a viable product to market quickly.

It turns out, that when it came to drumming up funds, you played fast and loose with the truth. You told investors that you and your CTO have extensive experience building LLM models and that your proof-of-concept web app/chatbot accesses a model you trained and answers questions potential customers may have regarding how Indexed Annuities work, thus increasing sales for financial services companies.

While you do have some valuable experience, the proof of concept is essentially a Wizard of Oz setup — a Javascript app that accesses ChatGPT with an API and sends in prompts and embeddings that tell it to act like a financial advisor. You did download some open-source models and tinker with them, but you don’t have the data you need to fine-tune them yet.

This is part of the game, though, right? After all, Amazon Fresh stores used Just Walk Out technology which ostensibly used deep learning image recognition to check people out without cashiers but it was actually more than 1,000 people in India watching and labeling videos.

You did consider several API options including:

  1. OpenAI GPT API
  2. Hugging Face Transformer API
  3. Google Cloud AI Language API
  4. Microsoft Azure Text Analytics API

But you wrote the code before OpenAI announced the option of creating a custom GPT in their app store where you can add custom instructions and files to OpenAI’s model within your custom GPT.

You were working on the assumption that the best way to proceed in a startup is to build an MVP (“Minimum Viable Product”) and then iterate the product as you learn. But now you’re beginning to believe that traditional software development is like doing as many pushups as possible whereas AI development is more like the 17th hole at TPC Sawgrass. One wrong move and all you can do is watch as the ball drops into the water as Sergio did (twice) in 2013 to lose to Tiger.

What you believe you need for your startup to be competitive is an open-source (or more accurately called open-weights since you dont get the source code) pre-trained model that you can then fine-tune on specific tasks with proprietary data. This should make your app better than ChatGPT at text generation, sentiment analysis, and question answering regarding detailed annuity products.

You spend time on open-source hubs learning about various models but you and your CTO disagree about whether you should develop in TensorFlow or PyTorch. You think you should use TensorFlow because it’s easier for beginners to use and thus will be easier to find cheap developers, and your CTO favors PyTorch because he knows it and the researchers he worked with use it.

  • Hugging Face’s Model Hub — specifically focused on LLMs
  • TensorFlow Hub — a repository of reusable machine learning modules that includes a variety of pre-trained models, including LLMs, that are compatible with TensorFlow
  • PyTorch Hub — similar to TensorFlow Hub but tailored for the PyTorch deep learning framework. It hosts pre-trained models, including LLMs, that can be integrated into PyTorch-based applications
  • GitHub — widely used platform for hosting code repositories. Many researchers and developers share their LLM models here

There are three types of models to choose from. You plan on using a combination of these in various application layers, but you’re not sure which models yet because you need to see what type of data you can procure first. Your options are:

Encoder-Only Models, such as BERT, RoBERTa, ALBERT, and ELECTRA, primarily focus on sentiment analysis, named entity recognition, and word classification. These models are good at reconstructing corrupted or compressed text data and are used for cleaning data. For instance, they might be used to repair incomplete or noisy financial reports by filling in missing information or correcting errors. They can also predict masked tokens in financial texts which helps comprehension and meaning.

Decoder-Only Models, including GPT and Bloom, generate text sequentially. They are used for text generation, language translation, and dialogue generation. These models can be used to generate financial reports, create personalized investment insights, and manage customer communications (i.e. chatbots).

Encoder-Decoder Models like BART, MarianMT, T5, and ProphetNet are used for language translation, text summarization, and question-answering. They can translate documents across languages, condense extensive financial reports into summaries, and respond to intricate inquiries about financial products or regulations. They are good at managing variable-length input and output sequences which makes them pretty adaptable. Span corruption techniques enhance their ability to train on and reconstruct critical information from financial texts, even in the presence of noise or gaps.

Your goal is to get to the point (later when you’re raising your next round) where you can Train Your Own Model.

The open-source (open-weights) models you have been considering are pre-trained. But you would like to procure your own defensible data sets that competitors can’t build and fine-tune a smaller proprietary model with them. Training a domain-specific model should produce better results and be more cost-efficient because your financial services data is highly specialized to the products. You need the model to capture industry nuances and vocabulary that massive pre-trained models miss. Once you have control of your own data, you can address biases, privacy, and regulations/compliance that will bog down your competitors especially once AI regulations begin to hit. Also, as you scale, data collection and computation will be cheaper, and there will be no licensing fees.

Tailoring Your Open Source/Weights Model

For now, however, you’re going to use an open-source model, not your own. You have located an initial financial services data set that will help you get started. Meta just launched a new version of Llama. It’s getting so much hype that you ignore all the information above regarding types of models and decide to use Llama.

(The use of prompt engineering is implied here but writing down the process would be like an early “How To Use Yahoo and Netscape on the Information Superhighway” guide. So it’s not covered.)

Next, you need to fine-tune your open-source model to become an expert in Indexed Annuities. Unlike pre-training which involves training models on huge amounts of general, unstructured data via self-supervised learning, fine-tuning involves supervised learning that uses much smaller datasets of labeled prompt-completion pairs. You can use pre-fabricated data sets for this or create your own. They look something like this:

“This is an annuity review. The product is [annuity product].” → “This is an excellent product for people over 60 who are conservative and wish to guarantee income in retirement.

Since you can’t find pertinent fine-tuning data, you hire interns to make it up for you. Being 22, they know nothing about annuities, of course, but you’ve got to get moving on the first release.

And now you’re ready to begin fine-tuning the model to see how much it improves.

You now discover that fine-tuning is a huge, complex topic. Some options include full model fine-tuning, layer-specific fine-tuning, learning rate scheduling, gradual unfreezing, prompt tuning, adding adapter modules, and regularization techniques. Wow! This stuff is getting complicated.

To further complicate things, your first attempt at fine-tuning caused catastrophic forgetting. Suddenly your model has forgotten a bunch of important stuff that it knew before you fine-tuned it. This occurred because you excessively adjusted the weights to fit the dataset the interns created.

This is too much, so you start playing around with fine-tuning products and apps such as AutoTrain, Axolotl, and LLaMA Factory. You decide on Unsloth which uses LoRA (Low-Rank Adaptation) adapters (components added to the model that allow for efficient fine-tuning on specific tasks without altering the entire model’s parameters which provides a balance between performance and computational efficiency).

You successfully fine-tune the initial open-source model and it seems to produce decent results. Now you need to evaluate it.

Standard tools like ROUGE-1 are available to assess the effectiveness of your model, particularly for automatic summarization tasks. But you don’t have time for that. You need to focus on what customers think about the results.

To assess customer feedback, you deploy the model in a test environment where select users can interact with it. You set up a simple feedback system to gather user impressions directly related to the model’s performance on tasks involving Indexed Annuities.

Perfect. Now you have a way to gather feedback on the model. The first product cycle is almost complete!

You email your investor and tell him the good news.

He responds immediately saying he is sitting with a friend who worked at the Securities and Exchange Commission for 20 years and is an expert in financial services law and wants to test the product right now.

You are screwed.

Welcome to the startup world!

I hope you enjoyed this article! Can I help with your LLM product or strategy? Connect with me on LinkedIn. For articles that will free your creativity and intuition, please join me at RightBrainCapitalist.

--

--