release truth vectors? #3

dribnet · 2025-01-20T01:15:01Z

Thanks for providing code to replicate the experiments!

Could you also provide the (optimal) truth vectors for the supported models?

dribnet · 2025-01-24T14:50:29Z

Here are the t_g and t_p truth vectors for meta-llama/Meta-Llama-3-8B-Instruct (aka llama-3-8b-chat) I extracted by running the truth_directions code locally; these are meant to be used on layer 12 of the residual stream.

truth_vectors.npz.zip

I've done some initial tests on these and seem to be getting sensible results. For example, if I look at t_g cosine similarity on a dataset of captions, I get the lowest numbers for descriptions of "low probability" (or perhaps just highly incongruous) descriptions like "monks playing rock music", "mona lisa smoking a cigar", and "an astronaut dog on martian terrain".

I can post other results on this thread including truth vectors for other models and/or can consolidate these into a pull request if there is interest.

dribnet · 2025-01-25T15:43:14Z

Out of curiosity I generated truth vectors for the new DeepSeek-R1-Distill-Llama-8B model. (again for layer 12)

truth_vectors.npz.zip

Most of the stats seemed reasonable - the only surprise was that the separation scores were lower. But the shape was right and they still peaked around layer 12.

My plan is to examine data that scores high (or low) in cosine similarity to t_g in this model but not in the llama-3-8b-chat model to get a sense of what the 'diff' between these truth vectors might be.

LennartBuerger · 2025-04-29T14:26:51Z

Hey Tom, thanks a lot for doing these very interesting experiments. Great to see that the DeepSeek Distill Llama also has this internal truthfulness representation. I am very sorry that I did not respond until now. I went on vacation and then forgot. I will monitor issues that have been opened in this repo more closely now :)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

release truth vectors? #3

release truth vectors? #3

dribnet commented Jan 20, 2025

dribnet commented Jan 24, 2025

dribnet commented Jan 25, 2025

LennartBuerger commented Apr 29, 2025

release truth vectors? #3

release truth vectors? #3

Comments

dribnet commented Jan 20, 2025

dribnet commented Jan 24, 2025

dribnet commented Jan 25, 2025

LennartBuerger commented Apr 29, 2025