ChatGPT Key Concepts

In the article we will review three ChatGPT Key Concepts you should know before start using OpenAI’s chatbot.

GPT Models

OpenAI’s GPT, or generative pre-trained transformer, is trained to comprehend both human language and coding patterns. These models produce textual outputs when given specific inputs, commonly called “prompts”. Effectively “programming” a GPT model involves crafting a suitable prompt, often through directives or exemplars that show how a task should be executed.

GPT models have versatile applications, such as generating content or code, summarizing, conversing, creative composition, and beyond. Dive deeper with our basic GPT guide and our GPT best practices handbook.

Embedding Techniques

An embedding translates a data segment, like a text, into a vector form that captures its content or semantic essence. Data segments with similar content or meaning will usually have closely related embeddings. OpenAI introduces models that transform a text snippet into its corresponding embedding vector. Such embeddings are pivotal for tasks like searching, grouping, recommending, spotting anomalies, classifying, and others. For a more detailed overview, see the comprehensive guide on embeddings.

Token Concept

Text is broken down into segments, termed tokens, by both GPT and embedding models. Tokens are representative sequences of frequently observed characters.

As an illustration, the word “tokenization” splits into “token” and “ization”, whereas a brief, prevalent word such as “the” remains as one token. It’s worth noting that in a sentence, the initial token of each word generally starts with a space.

You can explore the tokenization tool to experiment with various strings and observe their tokenized forms. Generally, 1 token approximates to 4 characters or about 0.75 words in English.

A constraint to remember is that for a GPT model, the combined length of the prompt and its resultant output shouldn’t surpass the model’s upper limit of context length. For models focusing on embeddings, which don’t produce tokens, the input’s length should be beneath the model’s maximal context length. The specific maximum lengths for each GPT and embedding model are accessible in the model directory.