Let's Talk.

Why ChatGPT has word limit?

3 minute read

Calendar Icon

ChatGPT has a word limit of 4,000 tokens, which is around 8,000 words. That’s a lot of words, but not enough. The reason for this limit is due to its implementation, which has a context limit of around 4,096 tokens. This means that it can only process and understand text input up to that limit. If you provide a longer text, it might get break down and give you a nonsense response or say “I don’t know”. That’s not very helpful, is it?

To overcome this limitation, one can use a technique called “batch processing”. This involves breaking down the input text into smaller chunks of text and processing each batch separately. For example, one can set the batch size to 250 words and give the AI 500 words of context (250 before and 250 after). This allows ChatGPT to process a large amount of text by feeding it smaller bites of text. It’s like eating a pizza slice by slice instead of trying to swallow the whole thing at once.

The limitation of ChatGPT’s fixed context length is a common issue in Large Language Models (LLMs). Transformer-based LLMs, such as ChatGPT, have a 2-D attention matrix that compares every token with every other token, resulting in the quadratic growth of memory and computation associated with increasing the context length. This makes it difficult to handle tasks that involve processing long documents or engaging in extended conversations.

In a Transformer, the attention mechanism calculates the relationships between each token in the input sequence. This is done by computing a 2-D attention matrix, where each cell represents the attention weight between two tokens. Since the attention mechanism compares every token with every other token, the number of computations grows quadratically with the input length.

To visualize this, imagine you have a document with 10 tokens (words). To compute the attention matrix, you need to compare each token with every other token, which means you need to do 10 x 10 = 100 comparisons. Now imagine you have a document with 100 tokens. To compute the attention matrix, you need to do 100 x 100 = 10,000 comparisons. That’s a lot more work for the Transformer! And if you have a document with 1,000 tokens, you need to do 1,000 x 1,000 = 1,000,000 comparisons. That’s a million comparisons! You can see how this can quickly become impractical for long texts.

One way to reduce the number of comparisons is to use multi-head attention. This means that the Transformer splits its query, key, and value parameters into multiple parts and passes each part independently through a separate attention head. This allows the Transformer to attend to different aspects of the input sequence at different positions. For example, one head might focus on the subject of a sentence, another head might focus on the verb, and another head might focus on the object. By combining the outputs of different heads, the Transformer can get a richer representation of the input sequence.

Another approach to address this issue is the Extended Transformer Construction (ETC). ETC uses a structured sparse attention mechanism, called global-local attention. The input is split into two parts: a global input with unrestricted attention and a long input with attention limited to the global input or a local neighborhood. This results in linear scaling of attention, allowing ETC to handle longer input sequences.

If you have any questions or feedback, please let me know.😊