Set up feature Validation #80

markwhiting · 2024-07-23T20:10:06Z

Fundamentally we only know if a feature is good if we can compare our result with some other (presumably more trustable) result, e.g., comparing to a human ground truth rating.

We need to build in a system for checking quality and reporting quality such that users can quickly know what to trust and think about how to improve it.

Caveat: we may be able to aggregate answers across sources in some cases to validate columns.

markwhiting · 2024-07-23T20:11:08Z

Proposal: Validate action that lets me create human ratings on papers for a given column without seeing the model's rating, then use these to check how well the model is doing and show results of that in context.

You are given papers who have not previously been validated, we store these as ground truth ratings and use them in downstream performance adjustments (e.g., DSL)

Highlight columns that are bad with some kind of coloring and on hover show details about the performance metric (F1 or R^2 etc) and score.

markwhiting · 2024-09-24T18:39:47Z

Here are some more details on how the types of truth and validation might work...

Let's use only 2 types of data, true and measurement. True is from a researcher and is considered 100% valid. A measurement is from a feature provider, e.g., GPT, and is what is validated against true. In this way, validation effectively reports measurement error.
For items with truth, we want to use the appropriate metric to check how good the measurement is. If the item is numerical, $R^{2}$ . If the item is categorical, we will use unbiased multiclass $F_{1}$ , and if the item is verbal, we will use a GPT comparison, e.g., "Do these things seem similar: yes or no?"
Validation looks like 1) download a view as CSV. 2) create a new column with a validation, e.g., participant_source TRUTH for each validated column, and fill in scores for nonblank validated items. New truth values overwrite old ones but by default we keep them and assume them to be true even if the feature version or provider changes.

markwhiting mentioned this issue Aug 28, 2024

Priority list #83

Open

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Set up feature Validation #80

Set up feature Validation #80

markwhiting commented Jul 23, 2024

markwhiting commented Jul 23, 2024 •

edited

Loading

markwhiting commented Sep 24, 2024 •

edited

Loading

Set up feature Validation #80

Set up feature Validation #80

Comments

markwhiting commented Jul 23, 2024

markwhiting commented Jul 23, 2024 • edited Loading

markwhiting commented Sep 24, 2024 • edited Loading

markwhiting commented Jul 23, 2024 •

edited

Loading

markwhiting commented Sep 24, 2024 •

edited

Loading