Skip to content

Remarks from Mark #105

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
4 tasks done
amirrr opened this issue Jan 8, 2025 · 3 comments
Open
4 tasks done

Remarks from Mark #105

amirrr opened this issue Jan 8, 2025 · 3 comments

Comments

@amirrr
Copy link
Collaborator

amirrr commented Jan 8, 2025

  • Remove grayed out (disabled) menu items.
  • We couldn't remove features or not-include them from a project
  • Drag and drop didn't work for papers on a project
  • Login page still requires login every time
  1. We need to accept comparisons, uploads, and infrastructure for
  2. Display ground truth (A way to document it to share our accuracy)
  3. Showing quality to user.
@amirrr
Copy link
Collaborator Author

amirrr commented Mar 14, 2025

TODOs with the new truth interface:

  • Save accuracy by project - (Later to be expanded to platform wide)
  • Accuracy should not be bucketed and the raw value determines the color
  • Display a meter for the grading (0 red to 1 green)
  • Turn the calculation into a long running process
  • fix F1 macro/micro to follow formula in observable notebook

@amirrr
Copy link
Collaborator Author

amirrr commented Apr 4, 2025

  • Try to look for a certain feature in a certain depth, for example the second condition in the first experiment, and they answer 5 questions from that condition. Then try to ask it the same thing on 5 different questions from the same level and compare.
  • Run the same paper maybe 10 times to see the variance in the answers it is giving.
  • Run the same papers (use a set we benchmark from) on all openai models, compare price, compare accuracy.
  • For whatever model best from last step, run ten times. Look at how much we learn failure (1 out 10 disagreement means 90% accuracy) -> are there features that are just 50% (worse than chance)
  • Do the first task with categorical features -> they give a better idea if it is doing right or wrong

@amirrr
Copy link
Collaborator Author

amirrr commented Apr 11, 2025

For adding new features we want a separate place where users can define and test their new features

experiment name into -> categorical name (how does this new feature do and

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant