Skip to content

release truth vectors? #3

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
dribnet opened this issue Jan 20, 2025 · 3 comments
Open

release truth vectors? #3

dribnet opened this issue Jan 20, 2025 · 3 comments

Comments

@dribnet
Copy link

dribnet commented Jan 20, 2025

Thanks for providing code to replicate the experiments!

Could you also provide the (optimal) truth vectors for the supported models?

@dribnet
Copy link
Author

dribnet commented Jan 24, 2025

Here are the t_g and t_p truth vectors for meta-llama/Meta-Llama-3-8B-Instruct (aka llama-3-8b-chat) I extracted by running the truth_directions code locally; these are meant to be used on layer 12 of the residual stream.

truth_vectors.npz.zip

Image

I've done some initial tests on these and seem to be getting sensible results. For example, if I look at t_g cosine similarity on a dataset of captions, I get the lowest numbers for descriptions of "low probability" (or perhaps just highly incongruous) descriptions like "monks playing rock music", "mona lisa smoking a cigar", and "an astronaut dog on martian terrain".

I can post other results on this thread including truth vectors for other models and/or can consolidate these into a pull request if there is interest.

@dribnet
Copy link
Author

dribnet commented Jan 25, 2025

Out of curiosity I generated truth vectors for the new DeepSeek-R1-Distill-Llama-8B model. (again for layer 12)

truth_vectors.npz.zip

Image

Most of the stats seemed reasonable - the only surprise was that the separation scores were lower. But the shape was right and they still peaked around layer 12.

Image

My plan is to examine data that scores high (or low) in cosine similarity to t_g in this model but not in the llama-3-8b-chat model to get a sense of what the 'diff' between these truth vectors might be.

@LennartBuerger
Copy link
Collaborator

Hey Tom, thanks a lot for doing these very interesting experiments. Great to see that the DeepSeek Distill Llama also has this internal truthfulness representation. I am very sorry that I did not respond until now. I went on vacation and then forgot. I will monitor issues that have been opened in this repo more closely now :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants