Chat Completions

The Chat Completions API is the main endpoint for generating AI responses with Oblix's intelligent orchestration.

Endpoint: `/v1/chat/completions`

This endpoint is equivalent to OpenAI's Chat Completions API and is used to generate responses based on conversation history.

Method: POST

Request Body:

{
  "model": "auto",
  "messages": [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What are the three laws of robotics?"}
  ],
  "temperature": 0.7,
  "top_p": 1.0,
  "n": 1,
  "stream": false,
  "max_tokens": null,
  "presence_penalty": 0.0,
  "frequency_penalty": 0.0,
  "logit_bias": null,
  "user": null
}

Parameters

Parameter	Type	Default	Description
`model`	string	-	Always use "auto" for Oblix's intelligent orchestration to select the best model based on connectivity, system resources, and the specific request
`messages`	array	-	An array of message objects representing the conversation history
`temperature`	float	0.7	Controls randomness: Lowering results in less random completions. Range: 0.0 to 2.0
`top_p`	float	1.0	Controls diversity via nucleus sampling: 0.5 means half of all likelihood-weighted options are considered
`n`	integer	1	Number of chat completion choices to generate for each input message
`stream`	boolean	false	If true, partial message deltas will be sent
`max_tokens`	integer	null	The maximum number of tokens to generate in the chat completion
`presence_penalty`	float	0.0	Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far
`frequency_penalty`	float	0.0	Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far
`logit_bias`	object	null	Modify the likelihood of specified tokens appearing in the completion
`user`	string	null	A unique identifier representing your end-user, which can help Oblix monitor and detect abuse

Response Structure

{
  "id": "chatcmpl-abc123",
  "object": "chat.completion",
  "created": 1677858242,
  "model": "auto (selected: ollama:llama2)",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "The Three Laws of Robotics as formulated by Isaac Asimov are:\n\n1. A robot may not injure a human being or, through inaction, allow a human being to come to harm.\n\n2. A robot must obey the orders given it by human beings except where such orders would conflict with the First Law.\n\n3. A robot must protect its own existence as long as such protection does not conflict with the First or Second Law."
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 30,
    "completion_tokens": 97,
    "total_tokens": 127
  }
}

Streaming Mode

Oblix supports streaming responses by setting the stream parameter to true. This is now the default behavior in the Python SDK. In streaming mode, the API will return a stream of events as the response is being generated.

Example Python Code with Streaming:

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:62549/v1",
    api_key="placeholder"  # Required by OpenAI client but not used by Oblix
)

stream = client.chat.completions.create(
    model="auto",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What is quantum computing?"}
    ],
    stream=True,
    temperature=0.7
)

for chunk in stream:
    if chunk.choices[0].delta.content is not None:
        print(chunk.choices[0].delta.content, end="")

Using the Oblix Python SDK with Streaming (Default):

from oblix import OblixClient

client = OblixClient()
# Configure models...

# Stream is enabled by default (stream=True)
response = await client.execute(
    prompt="What is quantum computing?",
    temperature=0.7
)
# Response is streamed to the console automatically

Example: Basic Usage

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:62549/v1",
    api_key="placeholder"  # Required by OpenAI client but not used by Oblix
)

response = client.chat.completions.create(
    model="auto",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What is the capital of France?"}
    ],
    temperature=0.7
)

print(response.choices[0].message.content)

Using the Oblix Python SDK:

from oblix import OblixClient

client = OblixClient()
# Configure models...

# Non-streaming execution with response as return value
response = await client.execute(
    prompt="What is the capital of France?",
    temperature=0.7,
    stream=False  # Disable streaming to get complete response at once
)
print(response["response"])

Example: Setting Different Temperatures

# For more deterministic responses
response = client.chat.completions.create(
    model="auto",
    messages=[{"role": "user", "content": "Write a poem about autumn."}],
    temperature=0.2  # Lower temperature for more focused, deterministic output
)

# For more creative responses
response = client.chat.completions.create(
    model="auto",
    messages=[{"role": "user", "content": "Write a poem about autumn."}],
    temperature=1.2  # Higher temperature for more diverse, creative output
)

Using the Oblix Python SDK:

# For more deterministic responses
response = await client.execute(
    prompt="Write a poem about autumn.",
    temperature=0.2  # Lower temperature for more focused output
)

# For more creative responses
response = await client.execute(
    prompt="Write a poem about autumn.",
    temperature=1.2  # Higher temperature for more diverse output
)

Example: Limiting Response Length

response = client.chat.completions.create(
    model="auto",
    messages=[{"role": "user", "content": "Explain the history of computing."}],
    max_tokens=100  # Limit response to approximately 100 tokens
)

Using the Oblix Python SDK:

response = await client.execute(
    prompt="Explain the history of computing.",
    max_tokens=100  # Limit response to approximately 100 tokens
)

Example: Chat Mode with Sessions

The Oblix Python SDK supports an interactive chat mode that helps maintain conversation context:

# Start an interactive chat session
result = await client.execute(
    prompt="Let's discuss quantum computing.",
    chat=True,  # Enable interactive chat mode
    stream=True  # Stream responses (default)
)

# The chat session ID is returned after chat ends
session_id = result["session_id"]

# You can also use an existing session
result = await client.execute(
    prompt="I have more questions about quantum computing.",
    session_id=existing_session_id,
    chat=True  # Start interactive chat using this session
)

Example: Adjusting Creativity with Top-p

response = client.chat.completions.create(
    model="auto",
    messages=[{"role": "user", "content": "Generate a story idea."}],
    top_p=0.5  # Only consider the top 50% of probability mass for each token
)

Endpoint: /v1/chat/completions​

Parameters​

Response Structure​

Streaming Mode​

Example: Basic Usage​

Example: Setting Different Temperatures​

Example: Limiting Response Length​

Example: Chat Mode with Sessions​

Example: Adjusting Creativity with Top-p​

Endpoint: `/v1/chat/completions`

Parameters

Response Structure

Streaming Mode

Example: Basic Usage

Example: Setting Different Temperatures

Example: Limiting Response Length

Example: Chat Mode with Sessions

Example: Adjusting Creativity with Top-p