PALE Large Language Models, instead of ``Open Source.''
TL;DR: The term “Open Source” should not be used to refer to LLMs, since these models are not open in a way that is coherent with the established meaning of open source software. I propose the use of the term PALE (Publicly Available, Locally Executable) LLMs. It arguably better communicates the intended meaning and is a nice sounding acronym.
The idea to use the term Open source in the context of software was thought up in February 1998 by Christine Peters (Read about it here) during a series of meetings aiming to discuss Netscape’s strategy to release its browser as under a free software-style license. Open source is, to a large extent, a newer, catchier name for Free Software, a nowadays arguably less well-known term that appeared in 1983 (Good information about that here). The free software movement has laid the underpinnings that open source builds upon. It aims promote freedom, as in free speech, for software, more specifically for programs (images and sound files are also software, but different principles apply). A program can be considered free software if its users are free to (0) execute it for any purpose, (1) study how the program works and modify it as one wishes, (2) copy and distribute it, and (3) distribute copies of modified versions. Availability of source code is a precondition for items (1) and (3). The coining of the term open source had the goal of making the ideas of free software more palatable to corporations, due to the ambiguity of the word “free” (“free as in freedom, not free beer”). However, conceptually, even though the emphasis of each term is on different aspects, they mean basically the same thing. Even their definitions (free software, open source) are very similar semantically, although stated in different terms. Also related is the notion of Open Source Hardware.
Having said all this, a quick Google search for the terms “free software” and *“open source software” (in March 20th 2024) reveals that the former is less popular (42.7M results) than the latter (82.1M). Furthermore, many of the results for the former pertain to gratis software, which is not the intended meaning of the term. Names matter. This takes us to the discussion about “open source LLMs”. IBM has the following to say about this:
The term “open source” refers to the LLM code and underlying architecture being accessible to the public, meaning developers and researchers are free to use, improve or otherwise modify the model.
Leveraging the definition of free software, basically a less bureaucratic definition of open source software, we can say that OS LLMs can satisfy items (0) and (2). One could argue that (3) also applies, as fine-tuned versions of the models can also be distributed. The problem lies in item (1) and that impacts (3). As mentioned before, availability of source code is a premise for these items, as it is required to both understand and modify the functioning of the program. The reasoning is that, although we could theoretically obtain a source code version by doing binary disassembling, the result would be unreasonably hard to understand for human beings. With LLMs, we have that problem at two levels, extrinsic and intrinsic. Extrinsic because many of these models are trained on data that is not publicly disclosed. Therefore, we cannot anticipate the situations where they are likely to perform well or poorly. Intrinsic, because even if the training data and procedure are disclosed, the model itself is represented as a set of weights that is arguably even less understandable than binary code, as the “instruction set” based on which they operate is not known.
One could argue that it is not “open source software”, but “open source LLM” and that entails other properties. That’s a fair point, considering that “open source” is a popular term that arguably succeeded where “free software” before it stumbled. Leveraging the popularity of this term to help popularize a new technology makes sense. At the same time, it is a misnomer. The model is not open in the sense that it cannot be studied and understood. For example, we cannot examine such a model in an attempt to discover if they can produce, e.g., racist, sexist, or fascist text, beyond simply querying it exhaustively. Its internals are, for all practical intents and purposes, closed. Furthermore, there is no source associated to it. All we have is an organized group of weights, floating point numbers, whose meaning can only be ascertained in a limited way, again, by exhaustive querying. So, if the model is not really open and there is no source we can refer to, why call it “open source”?
Finally, one can argue that, in the context of LLMs, “open source” comes from open source intelligence (OSINT). That’s maybe more appropriate. That term stems from the gathering of intelligence information from sources that make information publicly available, e.g., news broadcasters and online communities. In that sense, I agree that the term could apply to LLMs. According to the NATO definition of the term, “open” in this case means two things: publicly available sources and unclassified information. If we consider the LLM itself as information (which it is, since it is software), this definition clearly applies. However, from a more pragmatic perspective, LLMs are actually tools: what they do is important, as is what we are able to do with them, in terms of understanding and modifying them. The impossibility of actually looking inside an LLM in any meaningful way (remember: explainability of AI is still an active area of research) challenges the use of this term. Furthermore, “source” in the case of OSINT is about where the information comes from whereas in the context of software, it is about source code. Microsoft Office is publicly available (for a fee), but it is not reasonable to call it “open source”.
Having argued against the use of the established term, I propose a new term for these LLMs, based on the main properties that can be associated to them. Taking IBMs explanation as a starting point, we can identify three main properties that are relevant. These LLMs are (i) publicly available, (ii) free to use, and (iii) modifiable, by means of fine tuning. In particular, these are models that one can run on their own local machine, if it has enough power. Some of them even have fairly modest requirements to be run. With this in mind, I propose we refer to these models as Publicly Available, Locally Executable, using the acronym PALE. This is a short term that is explicitly not a misnomer and covers some of the most obvious properties of the things it refers to. It is not a perfect term; it does not cover the modifiability aspect of the models, nor their free-ness (in this case, free as in both freedom and free beer). However, “open source” is also not just about availability of source code. No name is perfect, but some are more imperfect than others.