KNIME logo
Contact usDownload
Read time: 8 min

A friendly intro to Large Language Models

Diving into the whats and whys of LLMs

October 19, 2023
Put simplyML 201 & AI
LLM header
Stacked TrianglesPanel BG

You've likely come across LLMs, or if not, you've almost certainly utilized them without realizing it. But what exactly are LLMs? How do they work? What tasks can they be useful for? Are there any risks in using them in production? Put simply, LLMs serve as the foundation for conversational AI systems that are currently reshaping the world of data analytics.

What are LLMs?

LLMs, or Large Language Models, are a class of deep neural networks that have gained immense traction in recent years due to their unparalleled proficiency in understanding and generating natural language, as well as their capacity of simulating human conversational behaviors. These exceptional abilities make LLMs very versatile and really easy for anyone to engage with them. Indeed, they exhibit impressive performance across a wide array of tasks, from casual chat conversation to translation, summarization, and even more complex activities like code generation and code review. Some of the best-known LLMs include OpenAI’s ChatGPT, Meta’s LLaMA or Google’s Bard.

LLMs are termed "large" due to their enormous size of trainable parameters (i.e., the internal adjustable settings learned by the model to perform specific tasks accurately), which can range from hundreds of millions to even trillions. For instance, OpenAI's GPT-3, one of the most renowned LLMs, boasts a staggering 175 billion parameters. These colossal sizes enable LLMs to capture intricate patterns in data and generate highly coherent and contextually relevant responses. The presence of more trainable parameters in a model directly correlates with its capability for high performance; hence, the growing tendency for models to become larger and larger. See the evolution from language models to large language models in the graphic below. 

Intro-to-LLMs
Model size evolution in terms of trainable parameters in 2018-2023: From language models to large language models

How do LLMs “think”?

Conceptually, LLM models function like highly sophisticated auto-completion systems, similar to the suggestions we receive while writing a text message on our smartphones. LLMs are trained to suggest the most likely next word or phrase based on previous extensive exposure to similar contexts. For example, if we type “the plane flies in the …'', the most probable follow up would be “sky”.  This is because the model was trained on millions of texts and learned “sky” to be the best next token. This proficiency enables an LLM-based AI tool to provide the most probable answer to a question, with reliability contingent on the breadth and quality of the training data.

Intro to LLMs
A simplified example of how a model returns the output. Based on a series of contextual tokens, the model outputs the next token with the highest probability. The last output token and the previous tokens are further employed as context information to predict the next token autoregressively.

It is, however, crucial to note that these types of technologies possess only syntactic knowledge and lack semantic understanding. It means that even though they are called “intelligent”, they  are not really aware of the meaning behind what they are saying. For instance, while an AI can correctly answer "1+2" as "3," it doesn't imply a true comprehension of logic operations; it only implies that the most probable answer was “3”. For the same reason, AIs are usually not capable of tackling more complex calculations, unless enhanced with ad-hoc external extensions for math operations that detect these particular types of questions and use an actual calculator for generating the response.

Key factors behind the power of LLMs

As LLMs are rapidly taking over the world of analytics and will probably revolutionize various aspects of our lives and work, it is important to understand what makes them so powerful and what can influence their performance. The strength of LLMs emerges form three primary factors:

Factor 1: Data

These models owe their prowess to the very extensive ingestion of unlabeled data, encompassing a diverse range of texts, sources and documents. The model then applies self-supervised learning to generate its own labels or targets from the data. This process allows LLMs to learn and refine their linguistic abilities to a remarkable degree. For example, OpenAI's chatbot was honed on an astonishing 570 gigabytes of text sourced from a wide array of credible outlets, including academic papers and books. Furthermore, these models have the capacity to further enrich their knowledge through interaction with users of the service, ensuring a continuous evolution of their linguistic capabilities.

Factor 2: Architecture

The architectural complexity of LLMs is relevant to their ability to understand the intricate nature of human communication. LLMs build essentially on the foundation laid by transformers, a class of deep neural networks distinguished by their remarkable ability to capture complex relationships between words. Transformers achieve this through a mechanism known as self-attention. Self-attention allows the model to weigh the importance of each word in a sentence in relation to every other word.

Unlike traditional neural networks, which process words sequentially, transformers can simultaneously consider all words in a sentence. This means they can discern nuanced connections, even if the words are far apart in the sentence. This fine-grained understanding of contextual relationships enables transformers to grasp subtle differences in meaning and produce more accurate and contextually relevant responses. This unique mapping capability through self-attention is a cornerstone of why transformers, and by extension LLMs, are so effective in tasks that require a deep understanding of natural languages.

Factor 3: Training

The training phase is arguably the most critical component in the development of LLMs. It empowers these models to recognize and comprehend intricate patterns in data, leading to their capacity for generating reliable and contextually relevant responses. LLMs’ vast neural network architecture comprises hundreds of billions of parameters, which essentially represent the model's learned knowledge. During training, these parameters continually evolve and adapt to optimize the model's performance. One notable technique used during training is "autoregressive modeling," where the model learns to predict the next word in a sentence based on the previous one. This method enables the LLM to capture the contextual dependencies within languages, contributing to its advanced understanding capabilities.

However, this process comes with significant resource demands, both in terms of computational power, time and data. Training an LLM from scratch is a computationally intensive task, often requiring specialized hardware and data centers due to the massive scale of the model. The sheer complexity of LLMs makes training a model from the group-up so expensive to the extent that this is primarily accessible only to tech giants.

Risks and limitations

Despite the undisputed advantages and endless possibilities LLMs offer, it is important to bear in mind that LLMs come with considerable drawbacks that require careful consideration.

Explainability

These models are often referred to as "black box" models, because they are so complex that it is impossible for a human being to peer inside and comprehend the decision-making process. In some cases his opacity can be harmful, prompting certain fields, such as banking or insurance, to outright prohibit the use of black box models. The inability to grasp the rationale behind AI decisions leaves room for potential errors that may go unnoticed. Consider, for instance, an AI system trained on historical data containing traces of racial or gender discrimination in determining who qualifies for insurance. This scenario is entirely plausible, especially considering the recent efforts to address discrimination. In such a case, the system may inadvertently perpetuate cognitive biases, making decisions based on erroneous patterns. Detecting these biases can be immensely challenging, if not impossible.

In response to the imperative need to comprehend the outcomes of complex black box models, the field of Explainable AI (XAI) was conceived. XAI hinges on elucidating intricate models by locally relying on simpler ones, often employing decision trees or linear models. These models rely on much more shallow architectures, which are easy to follow and understand for human brains. These simplified models are designed for human brains to follow and understand Despite commendable efforts and remarkable progress in this domain, keeping pace with the ever-increasing complexity of newly created and updated LLM models remains an arduous, Sisyphean challenge.

Intro to LLMs
LIME: A tool for XAI that works locally and explains the output of a more complex model (Source: Linardatos, Pantelis, Vasilis Papastefanopoulos, and Sotiris Kotsiantis. "Explainable AI: A review of machine learning interpretability methods". 2020)

Hallucinations, toxic content, and copyright infringement

In addition to explainability, LLMs pose concerns in the form of “hallucinations”. These occur when an AI provides responses not rooted in actual data or real information, but rather stem from learning errors or misinterpretations of input data. These hallucinations are hard to detect and they can pose significant harm if they occur in a decision-making process.

Given the elusive nature of hallucinations, it becomes imperative to exercise caution on the outputs generated by LLMs. These perceptual distortions can potentially lead to dangerous consequences, particularly in a professional setting. One example is the creation of noxious content, steeped in bias and stereotypes. If left unnoticed, this can cause significant harm.

Another scenario, carrying the potential for financial and legal repercussions, is when LLMs produce output derived from copyrighted material. This happens because a large portion of their training data is copyrighted. As a result, AI regurgitates the original training input (or fragments of it) incurring severe copyright infringement, rather than crafting a genuinely "new" output by blending various sources.

For these reasons, it's good practice not to blindly utilize what LLMs produce, but to use the output as a starting point to create something truly authentic. This not only reduces the risk of spreading inaccuracies or incurring legal issues but also encourages the development of fresh and unique content.

Data privacy

Lastly, consuming LLMs via third-party APIs pose the issue of leaking confidential data.Suppose we chose not to use a local, self-developed model, but instead chose an API from a renowned LLM for its enhanced performance. It's crucial to be aware that whatever information we input into the model operated via a third-party API may easily become the property of the provider, and is likely to be utilized for future training. Therefore, it is of utmost importance to avoid the provision of sensitive information in the interaction with AI systems. Especially within a corporate setting, local models are preferred despite their potential shortcomings, as models consumable via APIs do not guarantee data security, potentially resulting in serious data leaks.

Will humans become obsolete?

LLMs represent a relatively recent technological leap that has already exerted a profound impact on the world. Some experts push it further suggesting that, in the years to come, AI may become the next world outbreak after the advent of the internet. However, before overthinking and delving into speculative dystopian scenarios where AI dominates humanity akin to a Matrix-like scenario, it is advisable that everyone acquaint themselves with the use of LLMs (which is surprisingly easy). This familiarity can significantly heighten personal and professional productivity, automating mundane and repetitive tasks, thereby affording us the time and space to concentrate on matters of greater significance and interest.

In conclusion, we shall utilize the prowess of LLMs as our customized helpers, aiding in tasks from cooking to communication, and facilitating study, work, and, why not, in the creation of innovative AI systems. However, it is essential to approach this with discernment and rationality, infusing it with our creativity and imagination. After all, the unique capacities of human intelligence remain irreplaceable by AI.

GPT-3 Meets StackOverflow: Humans vs Machine
Blog

GPT-3 Meets StackOverflow: Humans vs Machine

March 16, 2023 | by Roberto Cadili, Anil Özer
Baking AI into your apps
Blog

Baking AI into your apps

September 19, 2023 | by Dipti Panchwadkar
How to take control of data drift
Blog

How to take control of data drift

April 21, 2022 | by Emilio Silvestri, Maarit Widmann