Token cost change when used with cached inference #5846

irf-rox · 2025-03-06T06:18:55Z

irf-rox
Mar 6, 2025

Hey all!

I tried to track the number of tokens spent using the cost() method, during consecutive execution of my code and found that when cache is created, both the cost for usage with and without cached inference are same. And running the code another time (now that cache has been created) produces the cost for usage with inference is same as before and without cached inference is 0.
Does this mean no tokens are spent when result is derived from cache?

My code:

import autogen
import os
from dotenv import load_dotenv

load_dotenv()

config_list = [
    {
        "model": os.getenv("AZURE_OPENAI_API_DEPLOYMENT_NAME"),
        "api_type": "azure",
        "api_key": os.getenv("AZURE_OPENAI_API_KEY"),
        "base_url": os.getenv('AZURE_OPENAI_ENDPOINT'),
        "api_version": os.getenv("AZURE_OPENAI_API_VERSION"),
    }
]

llm_config = {
    "seed": 43,
    "config_list": config_list,
    "temperature": 0,
    "timeout": 120, 
}

user_proxy = autogen.UserProxyAgent(
    name="User_Proxy",
    code_execution_config={
        "last_n_messages": 2,
        "work_dir": "coding",
        "use_docker": False
    },
    max_consecutive_auto_reply=10,
    is_termination_msg=lambda x: x.get("content", "").rstrip().endswith("TERMINATE"),
    human_input_mode="NEVER",
    llm_config=llm_config
)

entra_agent = autogen.AssistantAgent(
    name="entra_agent",
    llm_config=llm_config,
    system_message="""You are an expert Python developer specialized in working with Microsoft Entra. Your primary skills include:
    ..."""
)

departments = ["CS Technical", "CS Account", "CS Billing", "CS Other"]
tenant_id = os.getenv('MICROSOFT_ENTRA_TENANT_ID')
client_id = os.getenv('MICROSOFT_ENTRA_CLIENT_ID')
client_secret = os.getenv('MICROSOFT_ENTRA_CLIENT_SECRET')
task_message = f""........"""
    

chat_result = user_proxy.initiate_chat(entra_agent, message=task_message)

if chat_result:
    cost_info = chat_result.cost
    if cost_info:
        print(cost_info)
        yes_cache = cost_info.get("usage_including_cached_inference", 0)
        no_cache = cost_info.get("usage_excluding_cached_inference", 0)
        print(f"\n\nTotal Cost with cached inference: {yes_cache}")
        print(f"Total Cost without cached inference: {no_cache}")
    else:
        print("Token usage information is not available.")
else:
    print("No ChatResult object returned.")

Output during the first execution:

...
{'usage_including_cached_inference': {'total_cost': 0.016255, 'gpt-4o-2024-05-13': {'cost': 0.016255, 'prompt_tokens': 1727, 'completion_tokens': 508, 'total_tokens': 2235}}, 'usage_excluding_cached_inference': {'total_cost': 0.016255, 'gpt-4o-2024-05-13': {'cost': 0.016255, 'prompt_tokens': 1727, 'completion_tokens': 508, 'total_tokens': 2235}}}

Total Cost with cached inference: {'total_cost': 0.016255, 'gpt-4o-2024-05-13': {'cost': 0.016255, 'prompt_tokens': 1727, 'completion_tokens': 508, 'total_tokens': 2235}}
Total Cost without cached inference: {'total_cost': 0.016255, 'gpt-4o-2024-05-13': {'cost': 0.016255, 'prompt_tokens': 1727, 'completion_tokens': 508, 'total_tokens': 2235}}

Output during the second execution:

...
{'usage_including_cached_inference': {'total_cost': 0.016255, 'gpt-4o-2024-05-13': {'cost': 0.016255, 'prompt_tokens': 1727, 'completion_tokens': 508, 'total_tokens': 2235}}, 'usage_excluding_cached_inference': {'total_cost': 0}}

Total Cost with cached inference: {'total_cost': 0.016255, 'gpt-4o-2024-05-13': {'cost': 0.016255, 'prompt_tokens': 1727, 'completion_tokens': 508, 'total_tokens': 2235}}
Total Cost without cached inference: {'total_cost': 0}

Does this mean the cache cuts down the token cost? If so, please let me know if it halves the cost or completely removes it.

Cheers! Thanks in advance!

Answered by ekzhu

Mar 7, 2025

We stopped updating cost tracking and the cost is likely way off and unreliable now. Please use the latest version, which focuses on tracking token usage in model clients.

View full answer

ekzhu · 2025-03-07T19:35:59Z

ekzhu
Mar 7, 2025
Maintainer

We stopped updating cost tracking and the cost is likely way off and unreliable now. Please use the latest version, which focuses on tracking token usage in model clients.

1 reply

irf-rox Mar 8, 2025
Author

Hey @ekzhu .

Thank you for clearing up my doubt. It helped a lot!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Token cost change when used with cached inference #5846

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

Token cost change when used with cached inference #5846

irf-rox Mar 6, 2025

Replies: 1 comment · 1 reply

ekzhu Mar 7, 2025 Maintainer

irf-rox Mar 8, 2025 Author

irf-rox
Mar 6, 2025

Replies: 1 comment 1 reply

ekzhu
Mar 7, 2025
Maintainer

irf-rox Mar 8, 2025
Author