Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add audio utils to handle model audio input #7850

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

pretbc
Copy link

@pretbc pretbc commented Feb 25, 2025

[Feature] DSPy Audio/Video Support Tracking #7847

followed:

gemini implementation from LiteLLM

@isaacbmiller

@pretbc
Copy link
Author

pretbc commented Feb 25, 2025

let me check pytests

@pretbc
Copy link
Author

pretbc commented Feb 27, 2025

Did locally - ruff check . --fix-only

@isaacbmiller
Copy link
Collaborator

will try to review this weekend. Possible for you to make a mini tutorial or demo just to show that it works?

@pretbc
Copy link
Author

pretbc commented Feb 27, 2025

I will suggest as demo below snippet

lm = dspy.LM(
    "gemini-2.0-flash-exp", api_key=os.getenv("GOOGLE_APPLICATION_CREDENTIALS")
)
dspy.configure(lm=lm)

audio_path = "path/to/file.wav"


audio_data = pathlib.Path(audio_path).read_bytes()
audio_data_base64 = base64.b64encode(audio_data).decode("utf-8")

messages = [
    {
        "role": "user",
        "content": [
            {"type": "text", "text": "Analyze audio. Choose sentiment from: 'excited', 'neutral', 'confused', 'frustrated', 'happy', 'sad', 'angry'"},
            {
                "type": "image_url",
                "image_url": "data:audio/wav;base64,{}".format(
                    audio_data_base64
                ),
            },
        ],
    }
]


class Classify(dspy.Signature):
    """Classify sentiment of a given sentence."""

    audio: dspy.Audio = dspy.InputField()
    sentiment: Literal['excited', 'neutral', 'confused', 'frustrated', 'happy', 'sad', 'angry'] = dspy.OutputField()
    text: str = dspy.InputField(desc='Task description')

classify = dspy.Predict(Classify)


print(f'Call directly LLM: {lm(messages=messages)}')
print(f'Call DSPy signature: {classify(audio=dspy.Audio.from_file(audio_data), text='Analyze audio')}')

output

Call directly LLM: ['Confused.']
Call DSPy signature: Prediction(
    sentiment='confused'
)

underneath the system sends for signature

[{"role": "system", "content": "Your input fields are:\n1. `audio` (Audio)\n2. `text` (str): Task description\n\nYour output fields are:\n1. `sentiment` (Literal['excited', 'neutral', 'confused', 'frustrated', 'happy', 'sad', 'angry'])\n\nAll interactions will be structured in the following way, with the appropriate values filled in.\n\n[[ ## audio ## ]]\n{audio}\n\n[[ ## text ## ]]\n{text}\n\n[[ ## sentiment ## ]]\n{sentiment}        # note: the value you produce must exactly match (no extra characters) one of: excited; neutral; confused; frustrated; happy; sad; angry\n\n[[ ## completed ## ]]\n\nIn adhering to this structure, your objective is: \n        Classify sentiment of a given sentence."}, {"role": "user", "content": [{"type": "text", "text": "[[ ## audio ## ]]"}, {"type": "image_url", "image_url": {"url": "data:audio/wav;base64,UklGRpLICABXQVZFZDOXj5XDk="}}, {"type": "text", "text": "[[ ## text ## ]]\nAnalyze audio\n\nRespond with the corresponding output fields, starting with the field `[[ ## sentiment ## ]]` (must be formatted as a valid Python Literal['excited', 'neutral', 'confused', 'frustrated', 'happy', 'sad', 'angry']), and then ending with the marker for `[[ ## completed ## ]]`."}]}]

@pretbc
Copy link
Author

pretbc commented Mar 3, 2025

gonna check any conflict after review

@pretbc pretbc force-pushed the feature/audio_utils branch from 37a8a61 to 0567ab7 Compare March 3, 2025 12:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants