Skip to content

Mixed Data ML #668

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 17 commits into
base: develop
Choose a base branch
from
Draft

Mixed Data ML #668

wants to merge 17 commits into from

Conversation

luisheb
Copy link
Contributor

@luisheb luisheb commented Apr 17, 2025

Describe the proposed changes

Metrics, representation and machine learning functionality related to Mixed Data

Checklist before requesting a review

  • I have performed a self-review of my code
  • The code conforms to the style used in this package
  • The code is fully documented and typed (type-checked with Mypy)
  • I have added thorough tests for the new/changed functionality

@luisheb luisheb changed the title Mixed data ml Mixed Data ML Apr 17, 2025
Copy link
Member

@vnmabus vnmabus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a quick review, without looking deep into the code.

Important:

  • This PR mixes different features (style fixes, some different distances, means and stds, centering and standardization). Please try to separate each of them in its own PR if possible. This way it is possible for me to easily review and merge some parts as soon as they are ready while you improve the others in the meantime.
  • Add docstrings (with formulas) to clarify things.



def duoble_mean(X: FData) -> FData:
"""Compute the double mean of a FData object.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move that to the next line.

Suggested change
"""Compute the double mean of a FData object.
"""
Compute the double mean of a FData object.

@@ -361,3 +366,91 @@ def trim_mean(
trimmed_curves = X[indices_descending_depth[:n_samples_to_keep]]

return trimmed_curves.mean()


def duoble_mean(X: FData) -> FData:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo.

Suggested change
def duoble_mean(X: FData) -> FData:
def double_mean(X: FData) -> FData:

@@ -361,3 +366,91 @@ def trim_mean(
trimmed_curves = X[indices_descending_depth[:n_samples_to_keep]]

return trimmed_curves.mean()


def duoble_mean(X: FData) -> FData:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to offer the double mean at all? I have only seen it used in the context of distance covariance/correlation, so I do not know how useful would that be.

individual_observation_means = average_function_value(X)
if isinstance(X, FDataBasis):
mean_function = X.mean()

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would say something is missing here, otherwise it will raise an exception because double_fdata has no value.

return double_fdata

def individual_observation_mean(X: FData) -> NDArrayFloat:
"""Compute the grand mean of a FData object.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wrong docstring?

return res # type: ignore[no-any-return]


def weighted_lp_norm(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Functional version not needed.

Callable[[GridPointsLike], NDArrayFloat] | float | None
) = None,
) -> None:

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing docstring.

) = None,
) -> None:

# Checks that the lp normed is well defined
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would love if the code of Lp and this one could be merged together or simplified in some way.

return res[0] if len(res) == 1 else res


def same_structure_and_data(df1: pd.DataFrame, df2: pd.DataFrame) -> bool:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be documented and probably moved to the validation module.

metrics = metric.metrics if metric.metrics else [l2_distance] * n_cols
weights = metric.weights if metric.weights else 1.0

if isinstance(metrics, Metric):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe all these checks are cleaner with a match sentence?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants