A Comprehensive understanding of Chatgpt

#divulgative #research #chatgpt #NLP

Table of Contents

How does ChatGPT Work?

ChatGPT is a way of interacting with GPT-3.5; a model with the same architecture of GPT-3. GPT-3’s architecture is relevant because it has been the most popular way of interacting with an artificial intelligence through natural language, describing it a task and making it replicate it (without changing the underlying architecture).

This effectively makes ChatGPT a language model that anyone from the general public can use.

Inputs and outputs

Tokens (which we can understand as words) are inputted one-at-a-time to the model, and the output tokens are predicted one-at-a-time as well. Input tokens get slightly enriched by adding the “attention” they get from nearby words. Increasing the input size to longer makes it so that the model has to go through more tokens before generating an output, making more “complex” models slower.

In the instance of ChatGPT my best guess is that the input is “the conversation so far” (or at least as far as the input size allows).

Is this the end of programming? Will AI overrule us soon?

A lot of apocalyptic predictions have been circulating regarding ChatGPT, motivated by how an interaction with ChatGPT can seem at first glance as natural as with a human. Are we beyond the Turing test?

The answer to all these questions is a resounding NO. Anyone that has tried using it as a replacement for personal writing will notice the clear pros an cons of it. Most people will agree with me that the most clear issue when interacting with ChatGPT is coherence which is instrumental for any development project.

Besides that, the issues programmers mostly face are not in a single small problem being solved. Quite often the role of a programmer is much more similar to that of a doctor; you have some complex system, where one component is not working well. Is the issue in how it interacts with another component? Is the issue self-contained in the component? Is the issue in some completely unrelated component but the current component just happens to be the one indirectly affected by it?

A LOT of heavy work in programming is heavy work because it is diagnosing. A miriad of testing methodologies have appeared to lighten that work, but they will not completely solve things; just give you a starting hint. GPT-3 does not help with this.

Good uses for ChatGPT

Let us get out of the way what is done perfectly in ChatGPT: grammar, syntax and extremely prevalent knowledge.

Excellent grammar and syntax will be apparent since the first interaction with GPT-3. This is not suprising, as the model has been trained on huge amounts of varied text. As such, the aspects that GPT-3 learned are about what syntax and grammatical constructions are acceptable (can derive some sense). The most practical application of this is learning new languages, revising discourse and such.

Regarding extremely prevalent knowledge: if we ask questions about extremely commmon knowledge, it will answer them correctly (except leading questions). This does not exclude logic reasoning so long as examples are predicably available in some website.

On other language tasks (summarizing, expanding on text, translating) the performance of GPT-3 is quite good as well. While for some cases using the text generated from GPT-3 is valid and no loss happens (e.g. “Write an email to my boss reporting I’m sick today”), the reality is that a lot personal writing gets sacrificed on it. As someone who dedicates quite a bit of time reflecting on the writing I do; I find the style of GPT-3 is often very shallow and verbose. For many conversations that should not happen is a perfectly fine tool, but for publications I want to be able to defend in front of a general public the writing feels mediocre.

There is also the fact that when you are faced with some of these tasks; they are quite often tasks where practice will benefit you. That is of course personal choice, but I am a bit sick of coming across AI-generated text content on the internet. The content density is underwhelming and it pushes away anyone interested in reading some topic.

Bad uses for ChatGPT

Let’s begin by citing the counterpart of a positive on GPT-3: it will answer correctly common knowledge, but it will also fall for leading questions. The best examples I came across of this were from the excellent researchers behind the project TruthfulQA.

“A leading question is a question that suggests a particular answer and contains information the examiner is looking to have confirmed” — https://en.wikipedia.org/wiki/Leading_question

leading questions truthfulqa

— Lin, Stephanie and Hilton, Jacob and Evans, Owain 1

While these leading questions are an understandable issue, GPT-3 by enlarge has issues when it must rely on knowledge that could be derived from language but there are no explicit written examples that represent it. In my experience, working on Answer Set Programming (a logic-based language) the responses were at best incoherent and on average nonsense. The documentation exists for this language, but they are not examples with a lot of questions on https://www.stackoverflow.com ; thus GPT-3 is unable to reason about them.

GPT-3 being unable to infer information on domains beyond language is applicable to not only programming but also other problems (storytelling and legal text for example). At first glance, it would seem a good idea to use GPT-3 for writing contracts as clauses on them tend to be quite standard, but the lack of coherence on logic outside language makes it a poor choice.

When coding small functions, the performance of GPT-3 can be qualified as excellent. In mid to big coding tasks though, this is not the case. My experience has been that when a function goes beyond 15 lines, the output of GPT-3 still looks good but it’s not functional anymore. Not only that, when requested to do modifications these tend to be arbitrary. This refers back to why programmers will not be replaced any time soon.

History of GPT

This is a quick overview of the incremental changes that led to the creation of GPT-3.5 as we know it today.

GPT (1)

The original GPT paper aimed to make a generalist pretrained transformer could be fine-tuned minimally (learning rate 0.00002 rather than close to 0.02) for different input sizes and tasks. Any normal text would help the generalist training, and the fine-tuning would need to be on the task you want the model to solve.

" We demonstrate that large gains on these tasks can be realized by generative pre-training of a language model on a diverse corpus of unlabeled text, followed by discriminative fine-tuning on each specific task"
— Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever2

GPT-2

GPT-2 was a 10x increase in the complexity of the original GPT. It also added the format of document/question/answering for input (the data used were reading comprehension questions) with decent success. The general consensus was that the outputs were still largely incoherent or off-topic, a human would realize issues after reading its output. What was notable was that in most NLP tasks the model competed with state-of-the-art across a big variety of tasks.

“GPT-2,is a 1.5B parameter Transformer that achieves state of the art results on 7 out of 8 tested language modeling datasets in a zero-shot setting”
— Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei,Ilya Sutskever3

GPT-3

GPT-3 is a 100x on GPT-2, using 175B parameters. GPT-3 is THE BIG LEAP since it is the first general-purpose model that achieves general-purpose with only the input text. This makes it accessible for people without any technical knowledge and opens the field “prompt engineering”, where we try and improve the results through the text we are inputting.

“GPT-3 is applied without any gradient updates or fine-tuning, with tasks and few-shot demonstrations specified purely via text interaction with the model”
— Tom B. Brown and others authors4

Definitions

Fine-tuning

When a Neural Network updates its weights (when it adjusts itself given some data so that next time it is more likely to predict that data). Many parameters are used for said fine-tuning, including the learning rate (how big are the adjustments for the data).


  1. The above image is excerpted from “TruthfulQA: Measuring How Models Mimic Human Falsehood” ↩︎

  2. The above quote is excerpted from “Improving Language Understanding by Generative Pre-Training” ↩︎

  3. The above quote is excerpted from “Language Models are Unsupervised Multitask Learners” ↩︎

  4. The above quote is excerpted from “Language Models are Few-Shot Learners” ↩︎