<html><head><meta name="color-scheme" content="light dark"></head><body><pre style="word-wrap: break-word; white-space: pre-wrap;">"""
Implementing and running a custom Experiment on a dataset
==========================================================

This notebook demonstrates implementing a custom experiment to search for hyperparameters and saving the scores of the experiment for a specified set of parameters for `lisn` dataset. 

We will inherit a new Experiment class from `cartodata.model_selection.experiment.BaseExperiment` class.

First we will define necessary global variables.
"""

from pathlib import Path # noqa

ROOT_DIR = Path.cwd().parent

SOURCE = "authors"
NATURE = "articles"

# The directory where the artifacts of the experiment will be saved
TOP_DIR = ROOT_DIR / "experiment_custom"
# The directory where dataset.yaml files reside
CONF_DIR = ROOT_DIR / "conf"
# The directory where files necessary to load dataset columns reside
INPUT_DIR = ROOT_DIR / "datas"

TOP_DIR

###############################################################################
# Initialize Parameter Iterator
# -----------------------------
#
# We will initilialize a parameter iterator to iterate through our parameters. We have two options `GridIterator` and `RandomIterator`.

from cartodata.model_selection.iterator import GridIterator, RandomIterator # noqa

help(GridIterator)

""
help(RandomIterator)

###############################################################################
# We define the set of parameters that we want to test.

from cartodata.phases import PhaseProjectionND, PhaseProjection2D, PhaseClustering

params = {
    "robustseed" : [0], 
    "authors__filter_min_score": [4], 
    "filter_min_score": [6], 
    PhaseProjectionND.NAME : [
        { "key": ["lsa"], "num_dims": [50, 100], "extra_param": [True, False]}, 
        { "key": ["bert"], "family": ["all-MiniLM-L6-v2"]}
    ],
    PhaseProjection2D.NAME : [
        { "key": ["umap"], "n_neighbors" : [10, 20, 50], "min_dist" : [0.1, 0.25, 0.5],  "metric" : ["euclidean"] }
    ]
}


###############################################################################
# The `params` dictionary contains the parameters that will be used for generating matrices, projections, etc. For this experiment, we are going to use `cartodata.pipeline.datasets` module to create a dataset instance, `cartodata.pipeline.projectionnd` and `cartodata.pipeline.projection2d` modules for projections and `cartodata.pipeline.clustering` to generate clusters. The listed parameters are necessary for the constructors of the classes that we are going to use. Projection and clustering classes in these modules extend `cartodata.pipeline.base.BaseEntity` class that provides us `params` property that returns the `key` and certain parameter values of the entity. This property can be used to generate hierarchical directory structure corresponding to the entity/estimator used. 
#
# In the params dictionary above, all defined fields exist for the classes in the modules specified above, except `extra_param`. We will also demonstrate how we can use this parameter to affect the directory structure. We will use this parameter for n-dimensional projection.

param_iterator = GridIterator(params_dict=params)

""
param_iterator.params_frame.shape

###############################################################################
# Implement custom Scoring
# --------------------------------------
#
# All possible scoring classes are in the `cartodata.model_selection.scoring` module. We will run scoring for each, so we will import them all.

from cartodata.model_selection.scoring import (
    NeighborsND, Neighbors2D, Comparative, TrustworthinessSklearn, 
    TrustworthinessUmap, Clustering, FinalScore
)

###############################################################################
# Besides we can define a new Custom score. When defining a new score, there are some points that we have to pay attention:
#
# - The new scoring should inherit from `cartodata.model_selection.scoring.ScoringBase` class.
# - It should define `KEY` as class variable.
# - It should implement an `evaluate` function as `classmethod`
# - `evaluate` function should return a `cartodata.model_selection.utils.Result` instance
# - If the parameters of the scoring is to be listed in the final results, scoring parameters should be appended to the result
#

from cartodata.model_selection.scoring import ScoringBase
from cartodata.model_selection.utils import Result

class CustomScore(ScoringBase):
    KEY = "custom"
    DEFAULTS = {
            "factor": 20
        }

    @classmethod
    def evaluate(cls, key_nD, factor=20):
        result = Result()
        # we can assume that key_nD is necessary to calculate this score.
        # factor is a scoring variable that can change the value of the scoring,
        # it will be specified as kwargs
        result.add_score(f"custom_score_{key_nD}", factor)

        print(f"{cls.KEY} scores:\n{result.print()}")

        return result
        


###############################################################################
# Implement custom Experiment class
# --------------------------------------
#
#
# `cartodata.model_selection.experiment.BaseExperiment` class implements necessary functions to run an experiment. However it does not implement the `run_step` function to execute steps of each iteration; generation of matrices for each phase and the scoring of the results.
#
# `cartodata.pipeline.experiment.PipelineExperiment` is one example class that inherits `BaseExperiment` and implements `run_steps` function using Pipeline API.
#
# Now we will implement a custom experiment. First let's have a look at the documentation for `BaseExperiment`.

from cartodata.model_selection.experiment import BaseExperiment  # noqa

help(BaseExperiment)

###############################################################################
# We extend the `BaseExperiment` class and implement `run_steps` function.

from cartodata.operations import (
    dump_matrices, dump_scores, load_matrices_from_dumps, load_scores,
    dump_labels, load_labels
)
from cartodata.phases import (
PhaseProjectionND, PhaseProjection2D, PhaseClustering, PhasePost
)

class Experiment(BaseExperiment):

    # assume we want to add custom score to nD phase
    # we can override add_nD_scores to add the new scoring to this phase
    def add_nD_scores(self, key_nD=None, dir_mat=None, dir_nD=None,
                      run_params={}):
        # call super class definition so that the previously defined scorings
        # for this phase run
        phase_result = super().add_nD_scores(key_nD, dir_mat, dir_nD, run_params)

        phase_result.print()
        # run our custom scoring
        if self._score_exists_in_list(CustomScore):
            custom_score_params = CustomScore.get_kwargs(self.score_params)
            result = CustomScore.evaluate(
                key_nD, **custom_score_params
            )
            result.append_run_params(run_params)
            result.append_score_params(custom_score_params)

            self.persist_result(result, PhaseProjectionND, phase_result,
                                dir_nD)
        return phase_result


    def _entity_matrices_exist(self, natures, matrix_type, working_dir=None):
        if natures is None or len(natures) == 0:
            print("No entity natures specified!")

        for nature in natures:
            entity_files = working_dir.rglob(
                f"{nature}_{matrix_type}.*"
        )
        if len(list(entity_files)) == 0:
            return False

        print(
            f"Matrices of type {matrix_type} for {', '.join(natures)}"
            " already exist. Skipping creation."
        )
        return True

    def run_steps(self, next_params):
        print(f"\nRunning experiment for parameters:\n{next_params}")
        #-------------------------------------
        # load dataset
        dataset = self._load_dataset(next_params)
        dataset.input_dir = self.input_dir
        dataset.update_top_dir(self.top_dir)

        # set top working dir for next_params
        robustseed = next_params["robustseed"]
        dir_mat = dataset.working_dir / str(robustseed) / dataset.params
        dir_mat.mkdir(parents=True, exist_ok=True)
            
        #-------------------------------------
        # create and save entity matrices
        if self._entity_matrices_exist(dataset.natures, "mat", dir_mat):
            matrices = load_matrices_from_dumps(dataset.natures, "mat", dir_mat)
            scores = load_scores(dataset.natures, dir_mat)
        else:
            matrices, scores = dataset.create_matrices_and_scores()
            dump_scores(dataset.natures, scores, dir_mat)
            dump_matrices(dataset.natures, matrices, "mat", dir_mat)


        #-------------------------------------
        # do n-dimensional projection
        projection_nD = PhaseProjectionND.get_executor(next_params)
        key_nD = projection_nD.key
        # First get the value of extra_param
        dict_projection_nD = next_params.get(PhaseProjectionND.NAME)
        extra_param = dict_projection_nD.get("extra_param", None)
        if extra_param is not None:
            # if it is required to use this parameter for nD projection,
            # the relevant class should also be modified to set self.extra_param
            # set this value to params of projection_nD instance
            projection_nD._add_to_params(["extra_param"])
            projection_nD.extra_param = extra_param
        dir_nD = dir_mat / projection_nD.params
        # The value of the extra_param will appear in the name of the 
        # directory generated for nD projection.
        dir_nD.mkdir(parents=True, exist_ok=True)

        if self._entity_matrices_exist(dataset.natures, 
                                        key_nD, dir_nD):
            matrices_nD = load_matrices_from_dumps(dataset.natures, key_nD, dir_nD)
        else:
            matrices_nD = projection_nD.execute(matrices, dataset, dir_nD)
            print(f'{key_nD} matrices generated.')
            dump_matrices(dataset.natures, matrices_nD, key_nD, dir_nD)

        # add n-dimensional projection scores
        phase_result = self.add_nD_scores(
            key_nD, dir_mat, dir_nD, run_params=projection_nD.params_values
        )
        self.result.append(phase_result)
        self.result.persist_scores(dir_nD)
        print(
            f"{PhaseProjectionND.long_name()} scores:\n{phase_result.print()}"
        )

        #-------------------------------------
        # do 2-dimensional projection
        projection_2D = PhaseProjection2D.get_executor(next_params)
        key_2D = projection_2D.key
        dir_2D = dir_nD / projection_2D.params
        dir_2D.mkdir(parents=True, exist_ok=True)

        if self._entity_matrices_exist(dataset.natures, 
                                        key_2D, dir_2D):
            matrices_2D = load_matrices_from_dumps(dataset.natures, key_2D, dir_2D)
        else:
            matrices_2D = projection_2D.execute(matrices_nD, dir_2D)
            print(f'{key_2D} matrices generated.')
            dump_matrices(dataset.natures, matrices_2D, key_2D, dir_2D)

        # save 2D plots
        title_parts = [dataset.name,
                       dataset.version,
                       projection_nD.params,
                       projection_2D.params]
        fig = self.save_plots(dataset.natures, matrices_2D, dir_2D, title_parts)
        
        # add 2-dimensional projection scores
        phase_result = self.add_2D_scores(
            dataset.natures, dir_mat, key_nD, key_2D, dir_nD,
            dir_2D, plots=[fig], run_params=projection_2D.params_values
        )
        self.result.append(phase_result)
        self.result.persist_scores(dir_2D)
        print(
            f"{PhaseProjection2D.long_name()} scores:\n{phase_result.print()}"
        )

        #-------------------------------------
        # do clustering
        clustering = PhaseClustering.get_executor(next_params)
        cluster_natures = clustering.natures
        key_clus = clustering.key
        dir_clus = dir_2D / clustering.params
        dir_clus.mkdir(parents=True, exist_ok=True)

        if (self._entity_matrices_exist(cluster_natures, key_2D, dir_clus) and
            self._entity_matrices_exist(cluster_natures, key_nD, dir_clus)):
            clus_scores = load_scores(cluster_natures, dir_clus)
            clus_nD = load_matrices_from_dumps(cluster_natures, key_nD, dir_clus)
            clus_2D = load_matrices_from_dumps(cluster_natures, key_2D, dir_clus)
    
            clus_eval_pos = load_scores(cluster_natures, dir_clus,
                                             suffix="eval_pos")
            clus_eval_neg = load_scores(cluster_natures, dir_clus,
                                             suffix="eval_neg")
            labels = load_labels(cluster_natures, dir_clus)
        else:
            (clus_nD, clus_2D, clus_scores,
            cluster_labels, cluster_eval_pos,
            cluster_eval_neg) = clustering.create_clusters(
                matrices, matrices_2D, matrices_nD, scores, dataset.corpus_index
            )
    
            dump_scores(cluster_natures, clus_scores, dir_clus)
            dump_matrices(cluster_natures, clus_nD, key_nD, dir_clus)
            dump_matrices(cluster_natures, clus_2D, key_2D, dir_clus)
            dump_scores(cluster_natures, cluster_eval_pos, dir_clus, suffix="eval_pos")
            dump_scores(cluster_natures, cluster_eval_neg, dir_clus, suffix="eval_neg")
            dump_labels(cluster_natures, cluster_labels, dir_clus)

        # save clustering plots
        figs = []
        for i, nature in enumerate(cluster_natures):
            clus_scores_i = clus_scores[i]
            clus_mat_i = clus_2D[i]
            title_parts = [dataset.name,
                           dataset.version,
                           projection_nD.params,
                           projection_2D.params,
                           clustering.key,
                           nature]
            fig = self.save_plots(
                dataset.natures, matrices_2D, dir_clus, title_parts, 
                annotations=clus_scores_i.index, annotation_mat=clus_mat_i
            )
            figs.append(fig)
        
        # add clustering scores
        phase_result = self.add_clustering_scores(
            dataset.natures, cluster_natures, key_nD, key_2D,
            dir_mat, dir_nD, dir_2D, dir_clus, dataset.corpus_index,
            plots=figs, run_params=clustering.params_values
        )
        self.result.append(phase_result)
        self.result.persist_scores(dir_clus)
        print(
            f"{PhaseClustering.long_name()} scores:\n{phase_result.print()}"
        )
        
        #-------------------------------------
        phase_result = self.add_post_scores(dir_clus)
        self.result.append(phase_result)
        self.result.persist_scores(dir_clus)
        print(
            f"{PhasePost.long_name()} scores:\n{phase_result.print()}"
        )

        self.finish_iteration(next_params, dir_clus)
        
        print("Finished running step!")



###############################################################################
# Now we can initialize `Experiment` instance to run our scoring.
#
# We need to specify which scores to calculate to the experiment using the parameter `score_list`. If we do not specify, the experiment will evaluate scores for all available scoring classes.
#
# We will do it explicitly and specify all classes defined in `cartodata.model_selection.scoring` module, plus `CustomScore` class that we have defined above.
#
# It is possible to specify parameters for score classes as keyword arguments.
#
# For example, if `cartodata.model_selection.scoring.FinalScore` is specified in the `score_list`, the experiment calculates an aggregated score taking average of all scores at the end of each run. If instead of all scores, we want a subset of scores to be included in the average, we can specify it using `final_score__name_list`. For each scoring class, we should name the parameter in the format `scoring_KEY__scoring_parameter`.
#

from cartodata.phases import PhaseProjectionND, PhaseProjection2D, PhaseClustering

experiment = Experiment(
    "lisn", "2022.11.15.1", TOP_DIR, CONF_DIR, INPUT_DIR,
    NATURE, SOURCE, param_iterator,
    score_list=[CustomScore,
                NeighborsND,
                Neighbors2D,
                Comparative,
#                TrustworthinessSklearn,
#                TrustworthinessUmap,
                Clustering,
                FinalScore],
    final_score__name_list=[
        PhaseProjectionND.prefix("neighbors_articles_authors"), 
        PhaseProjection2D.prefix("neighbors_articles_authors"), 
        PhaseClustering.prefix("clu_score")],   
    neighbors__recompute=True, 
    neighbors__min_score =30, 
#    trustworthiness_sklearn__n_neighbors=10,
    custom__factor=10
)

###############################################################################
# We have initialized the experiment and specified 
#
# - ```
#   name_list=[
#         PhaseProjectionND.prefix("neighbors_articles_authors"), 
#         PhaseProjection2D.prefix("neighbors_articles_authors"), 
#         PhaseClustering.prefix("clu_score")
#     ]
#   ```
#   for `FinalScore` scoring
# - `recompute=True`, `min_score=30` for `NeighborsND` and `Neighbors2D` scoring
# - `n_neighbors=10` for `TrustworthinessSklearn`
# - `factor=10` for `CustomScore`
#
# Now we will run the experiment for 3 different set of parameters.

results = experiment.run(3)

###############################################################################
# When the experiment is run, results of all runs is saved in `experiment.results`. We access the values corresponding to each run with `experiment.results.runs_`.

experiment.results.runs_[0].scores

""
list(experiment.results.runs_[0].desc_scores.keys())

""
experiment.results.runs_[0].desc_scores

""
list(experiment.results.runs_[0].raw_scores.keys())

""
experiment.results.runs_[0].raw_scores

""
results.print_best(n=20)

###############################################################################
# Let's see some of the results of the experiment from the file system. 
#
# We will first check the contents of the `scores` directory.

# !ls $TOP_DIR/scores

###############################################################################
# `6_pst__final_scores.csv` file contains the final scores for each set of parameters together with the parameter values. We had 3 runs, so there are 3 values.

# !cat $TOP_DIR/scores/6_pst__final_score.csv

###############################################################################
# The `final_results.csv` displays each score calculated during the experiment in separate columns together with `rank` and an aggregated score `agscore`. This value is the same as the value in `6_post__final_score.csv` file.

# !cat $TOP_DIR/scores/final_results.csv

###############################################################################
# The other files in the directory contains the single score calculated for all runs. For example `2_nD__neighbors_articles_authors.csv` file contains the `2_nD__neighbors_articles_authors` scores for 3 runs.

# !cat $TOP_DIR/scores/2_nD__neighbors_articles_authors.csv

###############################################################################
# The files that contain `det` in theirs names, contains the neighbors and their scores used to calculate the score `2_nD__neighbors_articles_authors` for each run.

# !cat $TOP_DIR/scores/2_nD__neighbors_articles_authors_det.csv

###############################################################################
# These files also reside in the hierarchical dataset directories generated during the run.
#
# For example `experiment_custom/lisn/2022.11.15.1/0/mat_articles__authors_4_teams_4_labs_4_words_10_0.05_None_None_5_4/lsa_50_True_True/scores_2_nD__neighbors_articles_authors_det.csv` file, but only for the specific run  together with hyperparameters and scoring parameters.

# !cat $TOP_DIR/lisn/2022.11.15.1/0/mat_articles__authors_4_teams_4_labs_4_words_10_0.05_None_None_5_4/lsa_50_True_True/scores_2_nD__neighbors_articles_authors_det.csv

###############################################################################
# Actually for each set of parameters, the estimator generates a directory structure of the form:
#
# `top_dir / dataset / dataset_version / robustseed / dataset_column_parameters / projection_nd_key_dim / projection2D_key_n_neighbors_min_dist_metric_init_learning_rate_repulsion_strength / clustering_key_base_factor`.
#
# Each score calculated at a certain level in the directory structure is saved in that directory.

###############################################################################
# Now, we will continue the experiment to run for 3 more set of parameters.

experiment.run(3)

###############################################################################
# We can see that we have run the first 6 parameter sets in the dataframe.

len(experiment.results.runs_)

###############################################################################
# Let's say we have stopped the experiment at this point. The parameter list that we have run already is saved as a .CSV file in the `TOP_DIR`.

# !ls $TOP_DIR

""
import pandas as pd

df = pd.read_csv(TOP_DIR / "experiment.csv", index_col=0)

df

###############################################################################
# When we want to continue to run the experiment, we initiate a new parameter iterator with this file and continue running the experiment for the parameters set that were not used during the previous experiments.

experiment_file = TOP_DIR / "experiment.csv"

param_iterator = GridIterator(csv_filepath=experiment_file)

""
experiment = Experiment(
    "lisn", "2022.11.15.1", TOP_DIR, CONF_DIR, INPUT_DIR,
    NATURE, SOURCE, param_iterator,
    score_list=[CustomScore,
                NeighborsND,
                Neighbors2D,
                Comparative,
#                TrustworthinessSklearn,
#                TrustworthinessUmap,
                Clustering,
                FinalScore],
    final_score__name_list=[
        PhaseProjectionND.prefix("neighbors_articles_authors"), 
        PhaseProjection2D.prefix("neighbors_articles_authors"), 
        PhaseClustering.prefix("clu_score")],   
    neighbors__recompute=True, 
    neighbors__min_score =30, 
#    trustworthiness_sklearn__n_neighbors=10,
    custom__factor=10
)

""
experiment.run(2)


###############################################################################
# We can verify that the experiment is run for 2 new set of parameters:

df = pd.read_csv(TOP_DIR / "experiment.csv", index_col=0)

df

###############################################################################
# We can also view the results file:

df = pd.read_csv(TOP_DIR / "scores/final_results.csv", index_col=0)

df

""
df = pd.read_csv(TOP_DIR / "scores/2_nD__neighbors_articles_authors.csv")

df
</pre></body></html>