Skip to content

Set up feature Validation #80

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
markwhiting opened this issue Jul 23, 2024 · 2 comments
Open

Set up feature Validation #80

markwhiting opened this issue Jul 23, 2024 · 2 comments

Comments

@markwhiting
Copy link
Member

Fundamentally we only know if a feature is good if we can compare our result with some other (presumably more trustable) result, e.g., comparing to a human ground truth rating.

We need to build in a system for checking quality and reporting quality such that users can quickly know what to trust and think about how to improve it.

Caveat: we may be able to aggregate answers across sources in some cases to validate columns.

@markwhiting
Copy link
Member Author

markwhiting commented Jul 23, 2024

Proposal: Validate action that lets me create human ratings on papers for a given column without seeing the model's rating, then use these to check how well the model is doing and show results of that in context.

You are given papers who have not previously been validated, we store these as ground truth ratings and use them in downstream performance adjustments (e.g., DSL)

Highlight columns that are bad with some kind of coloring and on hover show details about the performance metric (F1 or R^2 etc) and score.

@markwhiting markwhiting mentioned this issue Aug 28, 2024
8 tasks
@markwhiting
Copy link
Member Author

markwhiting commented Sep 24, 2024

Here are some more details on how the types of truth and validation might work...

  1. Let's use only 2 types of data, true and measurement. True is from a researcher and is considered 100% valid. A measurement is from a feature provider, e.g., GPT, and is what is validated against true. In this way, validation effectively reports measurement error.
  2. For items with truth, we want to use the appropriate metric to check how good the measurement is. If the item is numerical, R 2 . If the item is categorical, we will use unbiased multiclass F 1 , and if the item is verbal, we will use a GPT comparison, e.g., "Do these things seem similar: yes or no?"
  3. Validation looks like 1) download a view as CSV. 2) create a new column with a validation, e.g., participant_source TRUTH for each validated column, and fill in scores for nonblank validated items. New truth values overwrite old ones but by default we keep them and assume them to be true even if the feature version or provider changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant