Local LLM integration with LLM Internal Inference

Amatatouille · July 15, 2024, 10:28am

Hello everyone,

I’m new to the Xwiki community and I’m interested in LLM integration.

I didn’t quite understand where the connection to the local LLM is made if you don’t want to use OpenAI. I understand that when configuring the extension manager in LLM Application, you mustn’t enter a URL, but I don’t understand how the connection is made.

What’s more, I don’t quite understand the embedding models. Are they there to increase the relevance of our prompt in the UI when using an LLM chosit in chat? Like a shadow LLM that ensures there’s a good understanding between the user and the model?

Finally, my last question is whether we can use a model that doesn’t follow the OpenAI chat/completion model if we configure a Chat Request Filter?

Thanks in advance for your answers!

MichaelHamann · July 16, 2024, 8:51am

You can configure any LLM provider that offers an OpenAI-compatible API. There are two steps:

Configuring the server. For chat models, you need to provide a URL, we currently don’t support running chat models in XWiki itself. This URL can be the URL of your installation of LocalAI or Ollama. In our experience, Ollama is the easiest option if you want to host LLMs yourself. The extension provides an example configuration with OpenAI, you can just override or delete that one if you don’t want to use it.
Configuring the model. Every model that shall be available to users needs to be added to the model list. Again, we provide some examples there, you can just remove or change them if you don’t want to use them.

Now regarding embedding models, they are used for providing relevant context to the LLM from your wiki when you also install the Index for the LLM application, add some collections and enable them in the model configuration. This uses a technique called Retrieval Augmented Generation (RAG). We currently don’t use embedding models for other purposes, but it is possible that we’ll implement other features using embedding models in the future.

The internal inference server currently only supports embedding models. Apart from a simple lack of time for implementing support for chat models, this is for two reasons:

Chat models are much more expensive to run and running them on the CPU where XWiki runs won’t provide a nice experience - you’ll probably have to wait at least a minute for a response unless the LLM is tiny and then the quality might not be good.
Ollama currently doesn’t provide an OpenAI-compatible API for embedding models, and we noticed problems with some embedding models on LocalAI where we got errors for certain content. Having an internal embedding model thus makes it significantly easier to set up a fully local LLM integration.

If you implement your own Chat Request Filter, yes, this should indeed be possible already as your chat request filter can simply “intercept” the request and not forward it to the actual model. My idea was to support different kinds of server APIs with a server type selection directly in the server configuration in the future but so far the need wasn’t strong enough as basically every LLM inference server provides an OpenAI-compatible chat completion API.

I hope this makes things a bit clearer, if not, feel free to ask for clarifications.