Working with short context windows
A context window of 1 or 2 million tokens seems to be enough for almost any task we could imagine. With multimodal models, you can just ask the model questions about one, two, or many PDFs, images, or even videos. To process multiple documents (for summarization or question answering), you can use what’s known as the stuff approach. This approach is straightforward: use prompt templates to combine all inputs into a single prompt. Then, send this consolidated prompt to an LLM. This works well when the combined content fits within your model’s context window. In the coming chapter, we’ll discuss further ways of using external data to improve models’ responses.
Keep in mind that, typically, PDFs are treated as images by a multimodal LLM.
Compared to the context window length of 4096 input tokens that we were working with only 2 years ago, the current context window of 1 or 2 million tokens is...