July 2024

Roll Your Own Bot: Notes on Law Firms Building Their Own Private AI Systems

July 2024

Ankwei Chen

Generative artificial intelligence (AI) tools such as OpenAI’s GPT series (e.g., ChatGPT) have taken the public by storm since 2022 for their ability to provide responses that bear an unprecedented resemblance to human-created content. The legal industry is no exception - in an August 2023 survey conducted by LexisNexis of about 8,000 lawyers, law students and consumers in the U.S., U.K., Canada and France, over 90% of respondents believe AI will have some impact on the practice of law, and about half of those believe the impact will be transformative. This is despite “technology-assisted” tools having been in use in the legal industry for a significant amount of time before generative AI came to the forefront, such as in e-discovery services offered by well-known data management service companies like Relativity. In any case, by mid-2024, the websites of almost all major legal research service providers contain some kind of advertisement on the integration of generative AI with their existing services: For example, LexisNexis is offering Lexis+, which is advertised to be able to assist in drafting legal documents and providing summaries and analyses of cases; Thomson and Reuters offer Westlaw Edge and Westlaw Precision, both of which claim enhanced research and analysis results through AI assistance. Furthermore, there are also community efforts like Legal-BERT that are specifically trained on legal documents and can serve as the base learning model for an AI chatbot specializing in legal matters.

The magic of generative AI, however, is largely attributable to the enormous amount of data gathered to train the underlying learning model. In the case of ChatGPT, OpenAI has disclosed that the GPT series are trained on data from all over the Internet, and ChatGPT may also use the user’s queries (which, to be fair, may be opted-out of) and ChatGPT’s own responses to such queries as additional data for the model to train and improve. In addition to potential copyright infringement issues in scraping data online without the permission of the author, the possibility that ChatGPT may regurgitate sensitive or personal information that it learned from user input have caused many large corporations, such as Amazon and Samsung, to implement restrictions on employees’ use of ChatGPT for work purposes. As a result, some organizations looking to introduce AI assistance into their workflow have been examining the feasibility of building private AIs – AIs operating entirely within the organization’s internal computing systems (or in private rented cloud space), with learning models built from scratch or modified from publicly available models and trained via private or secured data.

There are clear incentives for law firms to develop AI in-house: The more well-known law-related generative AI tools are still predominantly in English and trained on Anglo-American legal material, which would be of marginal use to practitioners who do not regularly deal with those laws. If you are a boutique law firm, the above services may still be too under-trained in your area of specialization, which tends to show up as gaps in reasoning or erroneous/misleading results. Similarly, those services may also offer functions that are superfluous to a law firm’s current needs and are thus relatively poor economic value. Finally, as long as competent technical support is on hand, the level of security and customization provided by total control over the training data are unparalleled and likely more economical in the long run compared to AI tools provided by large legal service providers.

With that said, the implementation of the Artificial Intelligence Act in the EU (the “EU AI Act”) has the potential to introduce additional legal considerations into the overall process of building a private AI solution. Because it is the first AI regulation framework of its kind in the world, and many other jurisdictions may reference its structure and spirit in drafting their own AI regulations, there is value in exploring how the current classification of AI systems in the EU AI Act and the relevant compliance requirements will play a role even if neither the firm nor the private AI system is expected to fall within the jurisdictional scope of the EU AI Act.

Accordingly, this article will cover the general decisions and issues a law firm looking to develop its own AI solutions can expect to face, including 1) Decide on the purpose of the AI, 2) decide on the learning model, 3) the dataset, 4) technical issues during pre-deployment testing, and 5) post-deployment matters, as well as the aforementioned regulatory compliance concerns where appropriate.

Intended purpose

The first step is for the firm to decide how generative AI can best support its daily operations. Because the intended purpose of the AI would impact every subsequent part of the building process, it is important to work out in advance what the firm’s needs are and what kind of (additional) resources the firm will need to procure for the AI.

The prime candidates for automation by AI in legal practice today are work that would otherwise be time-consuming and/or repetitive if done manually, but the ability for generative AI to create new content offers an extra dimension of utility beyond mere automation of routine tasks, such as predicting case outcomes or even simulate court or adversarial responses based on past data for testing arguments in advance. Based on the above, it is not surprising that in the aforementioned LexisNexis survey, the attorney respondents listed the following as the best potential uses for generative AI in practice: Research (65% of respondents), drafting documents (56%), document analysis (44%) and email writing (35%), all of which are tasks that are largely handled by junior personnel as a way of “learning the practice of law”. As the current generative AI learning models can all be made to handle such tasks with sufficient confidence given the proper training, any one or a combination of these uses may suitably serve as a starter generative AI project for a law firm. The customization of the AI based on training data means that as long as there is sufficient technical know-how, a firm is free to build a private AI that specializes in handling content other than legal documents if needed.

For AI regulations like the EU AI Act that bases the compliance requirements on the risks posed by the AI system, which is in turn directly related to its purpose, the decision on the intended purpose will determine the level of compliance required from the firm. Assuming that private AI system will be used for one or more of the aforementioned uses named in the LexisNexis survey, the AI system is likely to fall within either the “limited”^[1] or “minimal” risk categories in the EU AI Act’s risk categorization scheme, with only the “limited” category possibly requiring transparency obligations. However, it is not clear whether a law firm that develops a private AI system as described in this article (i.e., taking a pre-trained AI for further training on specific legal material and used within the firm’s secure environment) would be considered a “provider”^[2] or a “deployer”^[3] of the AI system as defined by the EU AI Act, in which the “provider” of a “limited” risk AI system has more compliance requirements, as well as liability exposure, than a “deployer” of the same.^[4] The current consensus is that the provider-deployer determination will need to be made on a case-by-case basis.

Deciding on the learning model

The next step is to select the learning model to use for the private AI. Despite the name, this step is in practice less of a technical exercise compared to the legal considerations involved.

While research on natural language processing (NLP), which refers to how computers can be instructed to understand human text and generate/respond in kind, can be traced back to the 1950s, the key breakthrough that led to the current burst of generative AI growth was the development of a new type of deep learning architecture called the “transformer model” at Google in 2017. Headlined by the development of BERT (Bidirectional Encoder Representations from Transformers) by Google and later GPT (Generative Pre-Trained Transformer) by OpenAI, transformer models revolutionized machine learning by incorporating an “[multi-head] self-attention mechanism” that comprehends the meaning of text input by assigning varying level of importance to the words therein at an unprecedented level of parallelism (the “multi-head”), which greatly boosts performance and accuracy compared to the previous “recurrent neural network” models. The end result is a dramatic reduction of training time needed for a given amount of data, thereby allowing much larger training data sets to be used – hence the name “large language models” (“LLMs”). Some of the most well-known LLMs are also called “foundational models” due to their role as the “foundation” of more specialized AI systems. Currently, the overwhelming majority of LLMs are transformers, thus this discussion assumes the use of a transformer model for the private AI.

To better understand the appropriate transformer to select based on the intended purpose, some basic technical concepts about the transformer architecture is needed. The transformer as envisioned in Google’s original research paper in 2017 handles translations from English to French or German via two major components known as the “encoder” and the “decoder”. In a very generalized sense, the encoder’s task is to contextualize the input text as a matrix made up of a set of mathematical vectors comprising of up to hundreds of numbers that abstractly represent the machine’s understanding of the context of the word in relation to the entire input text. The decoder’s task is to take the results of the encoder (i.e., the machine’s understanding of the input text) and generate an output by making predictions of the next word based on the previous word. So an input text of “I love you” would generate an output of “Je t’aime”; the secret sauce of the transformer that does the main lifting on both the encoder and the decoder side is the aforementioned self-attention mechanism. Given their aforementioned roles, encoders therefore tend to specialize in question answering, entity recognition, sentiment analysis and other tasks best accomplished with a fuller understanding of the input text, while decoders tend to specialize in generating text. Subsequent transformers that are intended for specialization in one or more of the above tasks tend to de-emphasize the other component, hence the so-called “encoder-only”, “decoder-only” and “encoder-decoder” transformers. It is important to note that these terms can be misleading as encoder-only transformers still need decoding to generate output, and decoder-only transformers still need to encode the input text.

BERT, as the acronym implies, is probably the most well-known encoder-only transformer, and its success have led to Google integrating BERT into its search engine to handle user queries. Derivatives like RoBERTa have further improved on BERT’s encoder capabilities by optimizing the pre-training method. On the other side, the GPT series are all decoder-only transformers, and the public renown from their text generation capabilities has resulted in a greater proliferation of decoder-only transformers compared to the other two architectures (approximately >2:1:1 ratio). Even though encoder-decoder transformers, like Google’s T5, may sound like having the best of both worlds, they are generally harder (and thus more expensive) to train in practice due to the need to consider both the encoder side and the decoder side, but if there is a need for robust text generation in addition to a strong understanding of the input text/document, the encoder-decoder architecture may be more economical in the long run.

Last but not least, one should take note of the licensing scheme, in particular how commercial use is treated. For proprietary LLMs, the licensing terms will generally need to be specifically negotiated with the owner company (if possible in the first place) and thus must be evaluated on a case-by-case basis. However, if the law firm chooses to go the open source route, it should take careful note of whether the licensing requirements would be a proper fit with the project. For example, the copyleft requirement in the GPL and Creative Commons ShareAlike licenses, which require all “derivative” works to also be licensed under GPL or CC BY-SA, may under certain circumstances introduce conflicts with confidentiality requirements. Recent developments have further added custom licensing schemes that specifically address issues arising from the use of AI, such as OpenRAIL (Open & Responsible AI License), whose restrictions are mostly behavior-based in order to promote the responsible use of AI, or Meta’s Llama2 LLM, which while freely downloadable on sites like Hugging Face, explicitly requires a license from Meta if monthly active users exceed 700 million (non-issue for most entities), and data generated by Llama2 may not be used to train another LLM (a trickier limitation). The permissive (non-copyleft) licenses, such as Apache 2.0, MIT and the various BSD, may be the best bet for most private AI uses, as there is very little in the way of restrictions on keeping the private AI proprietary or conflict with other licenses.

On the regulatory front, any regulation that will separately define “general purpose AI” (GPAI) like in the EU AI Act’s may complicate matters, as according to the EU AI Act, a “provider” of a GPAI has fairly substantial documentation and disclosure responsibilities that are designed for compliance by the large corporations that created the GPAI such as Google and Meta. The issue here is that a GPAI model is nominally defined^[5] to cover most of the well-known LLMs, despite the vagueness of the wording used, and as previously mentioned, a firm may be classified as a “provider” by fine-tuning the LLM or inserting/modifying the training data because those activities may be considered as “development”. A possible out is present for providers of open-source GPAI models that carries no “systemic risk”^[6], which would exempt the provider from having to periodically provide detailed technical documentation to the EU, but there is of course no ruling yet on which of the above license schemes would be considered “open-source” by the EU. While other jurisdictions may have completely different approaches that would render this regulatory murkiness less of an issue, because the caps on EU AI Act’s penalties^[7] are nearly on par with those for antitrust violations, there is arguable merit to wait for further clarification before committing.

Incorporating the Dataset: RAG and fine-tuning

After selecting the model, the next step is to provide the AI system with the relevant data to train. It should be noted right away that datasets for use by LLMs are not simply a bunch of court decisions, academic papers, snippets of code, encyclopedia entries, etc. Many LLMs require the data to be presented in a specific format, sometimes in a question-answer form, so that the model can transform them into vectors as described above. As such, if a firm intends to use its own raw data for training the LLM instead of ready-made datasets available on Hugging Face or elsewhere, it should plan for dedicating resources to converting the raw data into the appropriate format for use by the selected LLM.

Most well-known LLMs are pre-trained. Pre-training refers to the initial training of a LLM to develop an understanding of words and their various context used throughout language, somewhat analogous to the years of compulsory education for children worldwide for basic literacy. For many LLMs, their pre-training corpus (“body” of training data) includes various sections and derivatives of the Common Crawl dataset, which is provided by a nonprofit organization of the same name that operates web crawlers covering the entire Internet, updated each month. More specialized datasets include arXiv (scientific articles), Github (developer platform), PubMed (medical research) and many others.

Since the corpus of a pre-trained LLM may contain a lot of information that is unlikely to be helpful for the intended purposes of a law firm private AI, why not pre-train from scratch with just the desired corpus? It is generally not feasible to do so because the resources needed to effectively pre-train a LLM are extremely massive and thus likely out of reach for all but the largest law firms. To illustrate, Meta’s Llama2 (July 2023), a decoder-only LLM, is used as an example for a very commonly used foundation model with publicly available information on its pre-training method. Per Hugging Face, the Llama2 model has variants with 7, 13 and 70 billion parameters, all of which are trained on 2 trillion tokens of data. Parameters refer to the amount of numerical values the model has access to and adjust-on-the-fly in order to understand the relationship and context of words – in principle, the higher the number, the more capable the model is in handling and generating language – while tokens refer to smallest “units of text”, which includes words, parts of words, punctuation and others that the model uses to process text. For just the 7-billion parameter variant, which is on the smaller side of contemporary LLMs but among the most commonly used parameter sizes for LLMs in a firm setting, the training is said to have taken approximately 185,000 “GPU-hours”, with the GPU being NVIDIA’s A100-80GB, which was and still is one of the most advanced AI accelerators generally available for purchase commercially (i.e., not custom-made for training AI), and incurred a carbon footprint of just over 30 tCO₂eq (tons of carbon dioxide equivalent). So a firm would need to have the resources to either physically field, or rent through the cloud, hundreds of A100-class GPUs running continuously for months for the pre-training, just for the hardware side of the task. Moreover, unless the intended dataset has comparable content and variation to the aforementioned common pre-training sources, there is a major risk that the resulting LLM fails to attain a sufficient understanding of the language, or is otherwise only capable of generating output in the limited language style of the dataset, possibly rendering it ineffective for providing comprehensible, human-like responses. As there is no clear upside to start from scratch with just the fabric and sewing materials instead of tailoring a ready-to-wear shirt, it is assumed for the purposes of this article that the model selected has already been pre-trained.

To continue with the above analogy, how can the shirt be tailored to fit? Generally speaking, there are two major methods to build upon a pre-trained LLM for further specialization: The implementation of a “retrieval-augmented generation” (RAG) mechanism to the LLM, or proceed with LLM fine-tuning, which refers to additional training with smaller, more specialized datasets. Each has its own advantages and disadvantages.

By default, a LLM knows nothing besides the datasets it has already been trained on. which will inevitably become outdated or even contradicted as time goes on. RAG provides a way for the LLM to generate content that incorporates information beyond the data that it has been trained on by adding a method to retrieve the most relevant and up-to-date information from an external dataset to interpret the input, without having to periodically retrain the model, which can be potentially very resource-intensive as shown above. A frequent analogy used to describe how RAG works is an open-book exam, as the AI can look up relevant and live information before passing them to the LLM to generate a response. Besides access to updated information and lower resource cost, the advantages of RAG are numerous:

(i) Dataset customization: LLMs with RAG functionality has unmatched versatility from their ability to access a wide variety of external information through leveraging its retrieval mechanism, as well as customization of the scope of its dataset through managing the connections made by the retrieval system.

(ii) Scaling: As the LLM itself is not changed, the typical AI issues resulting from differences between the model complexity and dataset size, such as over(under) fitting and to some extent, hallucination and bias, are relatively minimized.

(iii) Greater developer-side control: For the most part, how LLMs generate the output response from its training data are completely obscured even from the developer, which complicates troubleshooting efforts when the response is unsatisfactory. In contrast, the control the developer has over the RAG retrieval system enables quicker pinpointing of issues, especially when it is possible to instruct the LLM to cite the source of its information alongside the output.

The use cases for RAG in law firm AI systems are enormous, given the great importance for the cited law or precedent to be based on the most accurate and updated information available. Virtually all AI intended to assist on legal research and document analysis will need some kind of connection to external legal datasets that are continuously updated in order to be minimally viable. Chatbots and customer service requires external information about the client. Even AIs for assisting in drafting emails and client communications would be more effective with access to past client information. Therefore, it may be rare to find law firm AI systems that do not implement RAG functionality in some form.

However, implementing RAG is not without its own issues. Firstly, building an effective retrieval system on top of the LLM is not a trivial technical task. Although there are several well-known framework environments, like LangChain and LlamaIndex, that provides tools and guidance for integrating a RAG mechanism into a “LLM app”, good knowledge of coding and understanding of how the external database is indexed, the optimal retrieval strategy (e.g., “chunk size”: As the current transformer structure are limited in the amount of tokens that can be processed at once as a single vector, a document need to be broken down into appropriately-sized “chunks” in order for the LLM to efficiently determine its relevance), among other technical matters, are still required for the building process. The additional retrieval process also introduces latency issues into a RAG-based AI system, leading to a worse user experience when compared with a LLM-only AI system. Furthermore, no amount of technical prowess can overcome issues with the external dataset itself, which can range from inconsistent document quality to improper maintenance or the lack thereof. Given the possible point of failure in the add-on retrieval system besides the LLM, the implementation of RAG to the private AI system will likely require more upkeep efforts from the law firm.

If RAG is likened to an open-book exam, fine-tuning is more like specialized preparation for the exam questions, akin to bar exam preparation courses. As opposed to retrieving the needed information from an external dataset, fine-tuning is the use of smaller and more specialized datasets to further train a LLM so that its performance in generating output related to such additional data improves compared to the purely pre-trained state. In other words, fine-tuning is in essence making the LLM actually attend law school classes instead of Googling the result. As a hypothetical example, while a pre-trained LLM may be able to name and discuss the U.S. Supreme Court case of Marbury v. Madison in responding to an input request of “Summarize the impact of judicial review in the United States”, a LLM that has undergone fine-tuning on U.S. constitutional law cases and academic papers may be able to further discuss how the courts have wielded such power over the years in relation to the ongoing politics at the time and how it has overall contributed to the current strength of the U.S. federal government as compared to the states, all of which cannot be found in general sources like Wikipedia that the pre-trained LLM trained on, so the pre-trained LLM cannot be expected to generate a similarly in-depth response. Moreover, even though a RAG-based LLM can conceivably generate the same level of detail in its output if it has access to the same cache of constitutional law data, the accuracy of the RAG response may not be as high as a fine-tuned LLM because it is based on the documents retrieved rather than the “innate” understanding of the model. Therefore, with all else being equal, in circumstances where the pace of updates is slow or if the most updated information is of questionable utility, fine-tuning may be more preferable to RAG.

Besides the enhanced ability to handle specialized tasks or domains, another potential advantage of fine-tuning is the customization of the LLM’s writing style and tone. Since the LLM’s parameters are being adjusted as part of the training process, depending on the dataset used, the LLM can be trained to generate output in a style that is more appropriate for certain legal tasks, such as client memorandums or even court documents. In addition, because the fine tuning process requires the firm to have control over the data the LLM should be exposed to, it is possible to train the LLM via fine tuning to avoid disclosing confidential and/or sensitive information in the generating content, thereby minimizing one of the biggest risks in using publicly available generative AI.

On the other hand, as mentioned under RAG, a fine tuned LLM, regardless of how well it is implemented, will remain as-is until it is further trained with new data. Since training is involved, the aforementioned large costs will also apply to fine-tuning, with the silver lining being that the size of the fine tuning dataset can be several orders of magnitude smaller than the pre-training datasets, thereby greatly reducing the amount of hardware needed. In addition, effective fine tuning of a LLM still requires solid knowledge on the fine tuning strategies to be applied (e.g., “parameter-efficient” fine-tuning, which only adjusts a small part of the entire set of parameters of the LLM to reduce overall memory consumption on less expensive hardware, or “instruction tuning”, which is fine-tuning on a separate dataset that does not provide additional knowledge to the LLM but instead enables it to better understand what kind of answer the user is expecting and thus respond accordingly), configuration of the appropriate LLM training “hyper-parameters”, and the preprocessing of the dataset, the details of which are beyond the scope of this article. As a result, a fine-tuned LLM solution will generally result in higher costs overall compared to a RAG-based solution.

Finally, RAG and fine-tuning are not mutually exclusive with each other, as their respective workings do not introduce conflicting changes to the AI system. As such, it is possible to implement RAG to a LLM that has already underwent fine tuning for the specific field or subject. While the idea of having the best of both worlds is clearly attractive, the amount of work and testing needed would naturally increase, along with additional possible points of failure caused by the use of two systems. The firm will have to make a decision on whether such tradeoff is acceptable in light of the intended purpose of the AI system and the firm’s resources.

Research is also underway to find possible synergies between the two mechanisms. An interesting recent development is a hybrid method called RAFT (retrieval augmented fine tuning) announced by a UC Berkeley research team in early 2024, which integrates RAG within the fine tuning dataset based on the notion that a student with an opportunity to study the relevant textbooks before using them in an open-book exam the next day should in principle outperform a student who only cracks open the textbook during the exam. The team chose Meta’s Llama2 7B-parameter model for their experiment and prepared a fine-tuning dataset in the form of a question, a set of reference documents, an answer generated from such documents, and an explanation with sources from those documents. The results show RAFT outperforming plain RAG in a variety of widely available datasets, and a smaller LLM using RAFT can achieve comparable performance to LLMs with larger parameters.

Under the EU AI Act categorization system, since the private AI system is unlikely to be classified as “high risk”, there are no specific compliance requirements in the EU AI Act on the nature and contents of the training data other than those that are already covered by other laws (e.g., copyright), but that does not preclude other jurisdictions from requiring such transparency.

Pre-deployment testing issues

Just as any other new tool, a firm building its own private AI system should conduct extensive testing before approving it for production level use. Unlike other software, where debugging-level access makes it possible (in most cases) for the developer to detect how a bug came to be and what needs to be done to prevent it, as mentioned above, the internal workings of the transformer under the LLM are completely hidden from humans, which makes the testing process much more of a trial-and-error affair. Some of the most common issues encountered while testing include:

(i) Hallucinations
This is being used to described instances of an AI generating a response that contains incorrect, misleading or fabricated information, but attempts to represent such as equally factual as the other parts of the response, hence the term “hallucination”. Although this phenomenon, along with anecdotal examples, is often cited for their potential amusement value and as evidence that generative AI cannot be taken seriously to replace humans any time soon, its memetic appearance in social media belies its prevalence and thorniness as an issue for the use of AI. On Github, there is a “Hallucination Leaderboard” that keeps a running score on how often major AI chatbots provide hallucinated information, which more or less corroborates with independent research showing a hallucination rate at anywhere between 2% to an astounding over 20%. A particularly notorious example in law involved a 2023 case before the U.S. District Court for the Southern District of New York, in which the plaintiff’s attorneys, in opposing a motion to dismiss, used ChatGPT (for the first time, according to the attorneys) to assist in drafting the brief. ChatGPT, however, cited six decisions that not only cannot be found the judge and the defendants in any database, its description of those fake cases further contained citations to nonexistent cases – but with party names like Varghese v. China Southern Airlines, it sounded real enough and the attorneys claimed ChatGPT assured them the cases are real when asked. While the fault in the above example should mostly be attributed to carelessness and perhaps overconfidence, given the relative ease of verifying whether a case cited by AI actually exists, the insidious nature of some hallucinations, such as inserting random, made-up statistics in an otherwise factually correct response that may escape all but the closest scrutiny, makes it one of the big “boogeyman” issues for generative AI.

The current theory is that since a LLM’s decoder generates output by predicting word-by-word which word should be used next based on what it has learned about the word in its training data, when there is an unexpected gap in its learning, the chain of predictions can go awry and veer off to something completely different, but due to the black-box nature of the transformer, there is currently no way to see or predict how exactly it happens. For example, if after “A”, “B” and “C”, the LLM only has an idea that something related to “four” or “fourth” should be next, it goes with “IV” instead of “D”, and likely follows that with “V”, leading to something that only looks logically consistent or plausible by itself, as well as causing the apparent “doubling down” by the LLM finding everything is consistent to what it has learned. Based on this theory, however, hallucination as a phenomenon is a consequence of how the decoder works and is therefore for all intents and purposes unavoidable.

As a result, efforts today focus on minimizing the occurrence of hallucinations rather than trying to prevent it entirely. Current methods of mitigating AI hallucinations generally involve a robust fact checking system alongside high quality training data. Besides non-machine learning based software fact checking solutions, RAG, as mentioned, can be used as for fact-checking, and there is significant ongoing research as to the optimal point of time retrieval occurs (i.e., before, during or after generation, or some combination thereof), but it has also been observed that while RAG can excel at verifying the truthfulness of a piece of information if all it takes is a retrieval of a document – such as whether Varghese v. China Southern Airlines was actually a real case – RAG is less capable of handling verification of persuasion-based text which requires comparison and synthesis of multiple sources. Other techniques include using feedback to improve the draft response over multiple iterations, and there are also projects that tackle the issue at the decoder level itself.

Due to the seriousness of AI hallucinations as an issue, research is extremely active in this area, with dozens of academic papers published over the past year on promising new methods. For the law firm currently thinking about building a private AI, anti-hallucination measures are a must and needs to be decided early on.

(ii) Overfitting
This is a relatively easy to understand AI phenomenon because it also manifests in humans in a similar fashion: A LLM that performed very well (i.e., it generates output that very closely matches expectations) on a given training dataset may underperform when faced with real-life data because it learned the training dataset “too well”, causing it to be unable to generalize what it has learned to the new data. The analogous situation with a human for overfitting would be that while intense study of exam preparation materials may lead to a top essay writing score in the bar exam, it does not translate well to drafting a memorandum in actual practice even if the same subject is involved, because the legal knowledge and test taking tips contained in the exam preparation materials are structured solely to enable the reader to maximize scoring pursuant to the relevant criteria, such as understanding a little-known rule that for some reason appears regularly on the exam to annoy exam takers but has nearly no relevance in actual practice.

Overfitting is thus likely to occur when the LLM is:

(1) Training for too long on the training dataset;
(2) Too complex (i.e., too many parameters) compared to the training dataset;
(3) Training on data that contains a lot of irrelevant information; or
(4) Training on too little data.

In each of the above circumstances, overfitting occurs because the LLM has the capacity to pick up noise data, such as outlier results or irrelevant background information, and believe them to be equally important and/or anticipated as the actual salient data; when these are missing from actual data, the LLM is likely going to be confused and render inaccurate results.

The most practical way to determine whether overfitting is present is to compare the training dataset output with output from other datasets; this is a key reason why it is considered best practice to split a single dataset into a training set, a validation set and a test set during AI training. Good accuracy in only the training set indicates overfitting. If overfitting is detected, the adjustments to be made are generally directly in response one or more of the aforementioned causes: The complexity of the LLM can be temporarily reduced in a number of techniques that in effect dis-incentivize the LLM from picking up noise data, noise can be cleaned up from the training dataset in advance, data can be artificially increased in size by a process called data augmentation, the training time can be reduced, etc. As the optimal measures to take to correct improper data fitting will obviously vary on a case-by-case basis, the process will likely involve some level of trial and error.

(iii) Catastrophic interference
Also called “catastrophic forgetting”, AI systems have been observed to be unexpectedly unable to recall context it has learned from past training data while undergoing fine-tuning or additional training on different datasets. The equivalent analogy here is for a pianist to suddenly and completely forget how to play the piano while learning how to play a different instrument. Once again, this is currently believed to be a natural consequence of sequential learning by neural networks: The parameters of a LLM start out as random variables that are adjusted during training so as to enable the LLM to properly predict the next appropriate word to use in generating output; they can be thought of as representing how the LLM understands language and context. However, when new information causes a new round of changes to the parameters, the LLM may be unable to recall the old understanding and context because the relevant parameters are no longer at the same values. As a result, the LLM can no longer generate output concerning the dataset that it was originally trained on with the same level of accuracy.

This phenomenon raises serious implications for any AI system that is planned to be constantly updated, or continually learning new tasks over time. Current techniques for mitigating catastrophic interference include ways to avoid the aforementioned overwriting of parameters caused by new information. One recent example is elastic weight consolidation (EWC), which detects and tags certain parameters noted as important for the previously learned task to reduce their likelihood of being changed by new information. Other methods find ways to maintain the old context, such as progressive neural networks, which changes up the architecture of the LLM so that new information do not alter the parameters of a network but lead to the creation of a separate network so that the old parameter values are retained. In addition, since the human brain does not exhibit this kind of forgetting in learning, there is ongoing research on mimicking memory consolidation in humans, which is currently thought to occur through sleep and dreaming.

Even though whether catastrophic interference is a significant concern depends on the firm’s fine-tuning and learning strategy for its private AI system, there should still be at least some awareness of this phenomenon when conducting training and consider the measures needed to mitigate its impact, which can be as simple as maintaining backups of the AI system to minimize downtime should something go wrong.

To summarize, the technical quirks of AI systems are numerous and not yet well understood in many cases, hence their corresponding solutions can sometimes feel improvised and inelegant. On the other hand, the incredible growth of research efforts and end-user level involvement in working with AI have made available a great deal of documentation resources for guidance, which in principle should make testing and troubleshooting less of a daunting task.

Post-deployment security

Once the private AI system has finally satisfactorily passed the relevant tests, it needs a properly established cybersecurity setup. In addition to many of the same practices applied to ordinary secured computer systems, such as making backups, staying updated regarding security vulnerabilities and patches, user training, etc., specific measures are needed to protect the model and the data, due to the valuable data involved and the potential for misuse and sabotage by bad actors:

(i) Access control: As the first line of defense, it is essential to establish control over who has access to the AI system, both as users and as system administrators who may interact with the LLM itself, to minimize the risk of unauthorized access that may put the firm at risk of serious liability.

(ii) Active monitoring: In addition to user access control, system logging, network activity monitoring and other system anomaly detection tools are needed to both warn against external intrusion as well as internal problems like the below.

(iii) Data protection: Since any unauthorized alteration to the dataset can lead to serious unintended results from the LLM, the datasets used for training and/or fine-tuning must be kept secure pursuant to appropriate encryption standards. Moreover, if client information is involved, such as from a chatbot, the firm would need to account for both client confidentiality and personal data protection compliance requirements.

(iv) LLM integrity check and protection: As described above, it is difficult to expect the LLM to maintain its performance for the same tasks indefinitely without reasonable maintenance. The firm needs to institute regular checks for hallucinations, overfitting, catastrophic interference and other AI-related phenomenon that are impacting its performance.

Conclusion

Private AI systems are becoming more and more popular due to the security it may offer without compromising on overall efficacy. While the technical and legal challenges of building a private AI system are now far from insurmountable, a law firm is still strongly advised to carefully plan each step of the building process beforehand based on the resources available to the firm, verify with relevant competent authorities regarding the AI regulatory rules, and protect the investment post-deployment by establishing a robust security system.

[1] If the private AI system will directly interact with a natural person, e.g., a chatbot.
[2] Article 3(3): “…A natural or legal person…that develops an AI system or a general-purpose AI model or that has an AI system or a general-purpose AI model developed and places it on the market or puts the AI system into service under its own name or trademark…”
[3] Article 3(4): “…A natural or legal person…using an AI system under its authority except where the AI system is used in the course of a personal non-professional activity;”
[4] Per Article 50, providers must ensure that whoever interacts with the AI must be notified that they are interacting with an AI, that the contents generated by an AI are clearly marked as such, and their technical solutions are effective and robust, while deployers are only responsible for disclosure if the AI involved is an emotion recognition system or biometric categorization system, or if the AI generates “deep fakes”.
[5] Article 3(63): “An AI model, including where such an AI model is trained with a large amount of data…that displays significant generality and is capable of competently performing a wide range of distinct tasks...” There is currently no clarification for terms like “significant generality” or “wide range of distinct tasks”.
[6] Article 51(1): “A [GPAI]model shall be classified as a [GPAI] model with systemic risk if it meetings any of the following conditions: (a) …high impact capabilities evaluated on the basis of appropriate technical tools and methodologies… ; (b) based on a decision of the Commission…” See also Article 51(2): “A [GPAI] model shall be presumed to have high impact capabilities pursuant to paragraph 1, point (a), when the cumulative amount of computation used for its training measured in floating point operations is greater than 10²⁵.”
[7] Article 99: Up to 7% of global worldwide annual turnover for noncompliance with prohibition on certain AI activities, or up to 1% global worldwide annual turnover of for improper information reporting. Compare with “…not exceed 10% of its total turnover in the preceding business year” for fines in antitrust cases (Article 23 of Council Regulation 1/2003).

The contents of all materials (Content) available on the website belong to and remain with Lee, Tsai & Partners. All rights are reserved by Lee, Tsai & Partners, and the Content may not be reproduced, downloaded, disseminated, published, or transferred in any form or by any means, except with the prior permission of Lee, Tsai & Partners.

The Content is for informational purposes only and is not offered as legal or professional advice on any particular issue or case. The Content may not reflect the most current legal and regulatory developments. Lee, Tsai & Partners and the editors do not guarantee the accuracy of the Content and expressly disclaim any and all liability to any person in respect of the consequences of anything done or permitted to be done or omitted to be done wholly or partly in reliance upon the whole or any part of the Content. The contributing authors' opinions do not represent the position of Lee, Tsai & Partners. If the reader has any suggestions or questions, please do not hesitate to contact Lee, Tsai & Partners.

The contents of all materials (Content) available on the website belong to and remain with Lee, Tsai & Partners. All rights are reserved by Lee, Tsai & Partners, and the Content may not be reproduced, downloaded, disseminated, published, or transferred in any form or by any means, except with the prior permission of Lee, Tsai & Partners. The Content is for informational purposes only and is not offered as legal or professional advice on any particular issue or case. The Content may not reflect the most current legal and regulatory developments.

Lee, Tsai & Partners and the editors do not guarantee the accuracy of the Content and expressly disclaim any and all liability to any person in respect of the consequences of anything done or permitted to be done or omitted to be done wholly or partly in reliance upon the whole or any part of the Content. The contributing authors’ opinions do not represent the position of Lee, Tsai & Partners. If the reader has any suggestions or questions, please do not hesitate to contact Lee, Tsai & Partners.