Elasticsearch Inference API adds open customizable service

We are excited to announce our latest addition to the Elasticsearch Open Inference API: the customizable integration! Any model that can be addressed by a REST API can be integrated with the new Custom Inference Service. Whether that model is hosted locally or running in the cloud, with just an URL and a few lines of JSON to define the API format, you can now configure a new inference service. The Custom service supports both sparse and dense text embeddings and the rerank task type.

Creating the Customizable Inference Endpoint

To create a Custom Service Inference Endpoint we’ll need to identify a few key components before we can issue the PUT request. Once we’ve identified those components, we’ll use Kibana’s Console to execute the commands in Elasticsearch without needing to set up an IDE. The request below shows the high-level format for the creation request.

PUT _inference/text_embedding/inference_service_name
{
    "service": "custom",
    "service_settings": {
        "secret_parameters": {
           <secrets>
        },
        "url": <url>,
        "headers": {
           <headers>
        },
        "request": <body definition>,
        "response": {
            "json_parser": {
                "text_embeddings": <response path>
            }
        },
        "input_type": {
            "translation": {
               <translation mapping>
            },
            "default": "query"
        }
    }
}

Here’s a brief explanation of each field

secret_parameters are stored securely and should include any sensitive information like API keys
url defines the path for connecting to the external service
headers defines any HTTP headers to include in the subsequent inference requests
request defines the template request body to send
response defines a JSONPath-like string dictating how to extract the embeddings from the response
input_type defines a mapping from the Elasticsearch input types to the values required by the 3rd party service. It is important when generating text embedding to ensure that the correct context is used.

The full documentation for the Custom Service can be found here: https://www.elastic.co/docs/api/doc/elasticsearch/operation/operation-inference-put-custom

Understanding the request fields

In this blog, we’ll use the Custom Service to connect to NVIDIA NIM, a new Inference Service from Nvidia for GPU accelerated inferencing.

To determine the values for Custom Service creation request fields, we’ll need to understand the request and response schema for the NVIDIA API.

NVIDIA supports an OpenAI-compatible schema for generating text embeddings. The API reference is explained here: https://docs.nvidia.com/nim/nemo-retriever/text-embedding/latest/reference.html.

We’ll use this example request from NVIDIA to define our Custom Service Inference Endpoint:

https://build.nvidia.com/nvidia/nv-embedqa-e5-v5?snippet_tab=Shell

curl -X POST https://integrate.api.nvidia.com/v1/embeddings \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $API_KEY_REQUIRED_IF_EXECUTING_OUTSIDE_NGC" \
  -d '{
    "input": ["The quick brown fox jumps over the lazy dog"],
    "model": "nvidia/nv-embedqa-e5-v5",
    "input_type": "query",
    "encoding_format": "float",
    "truncate": "NONE"
  }'

Below, we’ll discuss the parts we need from the request and what they’ll look like in the Custom Service creation request.

URL

The URL we’ll use to generate text embeddings is https://integrate.api.nvidia.com/v1/embeddings. To set the URL in the Custom service we’ll use the url field.

"url": "https://integrate.api.nvidia.com/v1/embeddings"

Headers

The example NVIDIA request requires a few headers to define the content type and authentication. To include this information in the Custom Service, we’ll leverage the headers and secret_parameters fields like so:

"secret_parameters": {
    "api_key": "<your api key>"
},
"headers": {
    "Authorization": "Bearer ${api_key}",
    "Content-Type": "application/json"
}

Make sure to replace the <your api key> with the actual API key surrounded by double quotes. We’ll discuss what the ${api_key} template means a little later.

Body

We’ll use most of the fields from the body of the example request to build a string that is used to construct the actual request body the Inference API sends to NVIDIA.

"request": "{\"input\": ${input}, \"model\": \"nvidia/nv-embedqa-e5-v5\", \"input_type\": ${input_type}, \"encoding_format\": \"float\", \"truncate\": \"NONE\"}"

We don’t include the input text in this request because the Custom service creation request is only for configuring how the Inference API will structure the request to send to NVIDIA. We’ll include the input text when we generate the embeddings in a different request later. We also removed the “input_type”: “query”. We replaced this with a template so that it can be dynamically populated depending on the request context being used (if we know it will always be query related we could also set it to “query”). We’ll go over how the Custom Service input_type field works later in the blog.

Understanding the response fields

To complete the Custom service definition , we’ll need to identify the path to the embeddings within the response from NVIDIA API.

Here’s an example response (https://docs.nvidia.com/nim/nemo-retriever/text-embedding/latest/reference.html#generate-embeddings):

{
  "object": "list",
  "data": [
    {
      "index": 0,
      "embedding": [
        0.0010356903076171875, -0.017669677734375,
        // ...
        -0.0178985595703125
      ],
      "object": "embedding"
    }
  ],
  "model": "nvidia/nv-embedqa-e5-v5",
  "usage": {
    "prompt_tokens": 0,
    "total_tokens": 0
  }
}

The Custom Service needs a JSONPath-like (https://en.wikipedia.org/wiki/JSONPath) string to instruct it on how to extract the embeddings. For this response format, the string will be $.data[*].embedding[*].

$ indicates the root of the object
A . (dot/period) indicates looking for a nested field
data[*] indicates to look for a field called data, the [*] specifies that the field should be treated as an array embedding[*] is treated the same as data[*]

Putting it all together

Now that we understand the NVIDIA request and response schema we can use those pieces to construct the creation request. The Custom service PUT request will be:

PUT _inference/text_embedding/inference_service_name
{
    "service": "custom",
    "service_settings": {
        "secret_parameters": {
            "api_key": "<your api key>"
        },
        "url": "https://integrate.api.nvidia.com/v1/embeddings",
        "headers": {
            "Authorization": "Bearer ${api_key}",
            "Content-Type": "application/json"
        },
        "request": "{\"input\": ${input}, \"model\": \"nvidia/nv-embedqa-e5-v5\", \"input_type\": ${input_type}, \"encoding_format\": \"float\", \"truncate\": \"NONE\"}",
        "response": {
            "json_parser": {
                "text_embeddings": "$.data[*].embedding[*]"
            }
        },
        "input_type": {
            "translation": {
                "search": "query",
                "ingest": "passage"
            },
            "default": "query"
        }
    }
}

Templates

Templates provide a way to defer defining a value until a request is being sent to the external service. A template is a string with the form ${some_name}. It starts with a dollar sign and an open curly bracket ${ and ends with a closing curly bracket }. When the Custom Service builds the request to send to NVIDIA it replaces the templates with values specified within secret_parameters and task_settings objects. Templates are matched with their values by looking for the string within the curly brackets within the secret_parameters and the task_settings. For example, the ${api_key} template will be replaced with the API key defined in the secret_parameters. If the API key was the value abc, the Authorization header’s key would be Bearer abc.

There are a few built-in templates.

${input} refers to the array of input strings that comes from the input field of the subsequent inference requests.
${input_type} refers to the input type translation values. NVIDIA supports query for search requests and passage for ingest requests. If Elasticsearch attempts to send a key (like classification) that is not defined in the translation mapping, the default value (query in this case) will be used instead.
${query} refers to the input query required for the rerank task type https://www.elastic.co/docs/api/doc/elasticsearch/operation/operation-inference-rerank

You can find more information on the built-in templates here: https://www.elastic.co/docs/api/doc/elasticsearch/operation/operation-inference-put-custom

When we execute the create Custom service Inference Endpoint request we should get a response like the following:

{
    "inference_id": "inference_service_name",
    "task_type": "text_embedding",
    "service": "custom",
    "service_settings": {
        "similarity": "dot_product",
        "dimensions": 1024,
        "url": "https://integrate.api.nvidia.com/v1/embeddings",
        "headers": {
            "Authorization": "Bearer ${api_key}",
            "Content-Type": "application/json"
        },
        "request": "{\"input\": ${input}, \"model\": \"nvidia/nv-embedqa-e5-v5\", \"input_type\": ${input_type}, \"encoding_format\": \"float\", \"truncate\": \"NONE\"}",
        "response": {
            "json_parser": {
                "text_embeddings": "$.data[*].embedding[*]",
                "embedding_type": "float"
            }
        },
        "input_type": {
            "translation": {
                "ingest": "passage",
                "search": "query"
            },
            "default": "query"
        },
        "rate_limit": {
            "requests_per_minute": 10000
        },
        "batch_size": 10
    },
    "chunking_settings": {
        "strategy": "word",
        "max_chunk_size": 250,
        "overlap": 100
    }
}

Now that we’ve created the Custom Service Inference Endpoint, let’s generate some text embeddings using it.

POST _inference/text_embedding/inference_service_name
{
    "input": ["The quick brown fox jumps over the lazy dog"]
}


The response is the float embedding.
{
    "text_embedding": [
        {
            "embedding": [
                -0.033294678,
                -0.010848999
                ...
            ]
        }
}

Wrapping up

Connecting to new Inference providers or bespoke services is now easy and seamless with the Elastic Open Inference API. For more examples of how to leverage the Custom Service Integration, take a look at: https://www.elastic.co/docs/api/doc/elasticsearch/operation/operation-inference-put-custom. We continue to bring state-of-the-art AI tools and providers to Elasticsearch, and we hope you are as excited as we are about our Custom service Integration! Head over to Elasticsearch Labs to explore building search applications for generative AI with Elastic.

Test Elastic's leading-edge, out-of-the-box capabilities. Dive into our sample notebooks, start a free cloud trial, or try Elastic on your local machine now.