Using Pipeline API with YAML file

This notebook demonstrates the usage of Pipeline API by loading pipeline from pipeline.yaml file.

For creating a pipeline using Pipeline constructor, please see the notebook [pipeline_lisn_lsa_kmeans.py](./pipeline_lisn_lsa_kmeans.py).

YAML files that define configuration for the Pipeline are located in the ../conf directory. For each dataset, there exists a directory under the conf directory, and in the dataset directory, we have dataset.yaml and pipeline.yaml files.

We can list all the datasets where a pipeline is defined:

from pathlib import Path

ROOT_DIR = Path.cwd().parent

CONF_DIR = ROOT_DIR / "conf"

""
# !ls $CONF_DIR
''

We will work with lisn dataset file.

# !ls $CONF_DIR/lisn

Here we see both a dataset.yaml and pipeline.yaml. dataset.yaml is used by pipeline.yaml to initialize lisn dataset. Let’s see the contents of this file:

# !cat $CONF_DIR/lisn/dataset.yaml

Let’s view the contents of the pipeline.yaml file:

# !cat $CONF_DIR/lisn/pipeline.yaml

Load Pipeline from YAML file

We can load this pipeline using cartodata.pipeline.loader.load_pipeline.

We will run this in the project’s root directory (parent directory of the examples directory) as the paths in the pipeline.yaml file are relative to the project’s root directory.

from cartodata.pipeline.loader import load_pipeline  # noqa
import os  # noqa
from pathlib import Path # noqa

# run in parent directory
os.chdir("../")

current_dir = Path(".").absolute()
conf_dir = current_dir / "conf"

pipeline = load_pipeline("lisn", conf_dir)

We can simply run the pipeline with the configuration specified in pipeline.yaml file.

We do not want to use the dump files from a previous run, so we set force=True and to see the results of the run, we would like to save the plots.

pipeline.run(save_plots=True, force=True)
Downloading data from https://zenodo.org/records/7323538/files/lisn_2000_2022.csv (6.3 MB)


file_sizes:   0%|                                   | 0.00/6.59M [00:00<?, ?B/s]
file_sizes:  45%|███████████▋              | 2.97M/6.59M [00:00<00:00, 28.9MB/s]
file_sizes: 100%|██████████████████████████| 6.59M/6.59M [00:00<00:00, 47.8MB/s]
Successfully downloaded file to dumps/lisn/2022.11.15.1/lisn_2000_2022.csv

This runs the following steps:

  • loads the dataset data from file

  • generates entity matrices for all natures and saves them

  • executes projection and saves the results

  • executes 2D projection and saves the results

  • creates clusters

  • finds neighbors

  • saves all data to export.feather file

The results are saved under pipeline.working_dir.

for file in pipeline.working_dir.iterdir():
    print(file)
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/nearest_teams_for_articles.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/words_mat.npz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/words.index
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/articles_lsa.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/ll_clusters_umap.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/ml_clusters_scores.csv.bz2
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/teams_scores.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/vll_clusters_eval_negscores.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/vll_clusters_umap.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/nearest_words_for_labs.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/authors_lsa.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/nearest_labs_for_authors.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/hl_clusters_scores.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/words_umap.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/words.index.dat
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/ll_clusters_labels.json
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/nearest_articles_for_labs.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/ll_clusters_scores.csv.bz2
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/hl_clusters_lsa.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/vll_clusters_labels.json
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/nearest_authors_for_labs.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/hl_clusters_eval_posscores.csv.bz2
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/hl_clusters_eval_negscores.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/nearest_authors_for_words.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/nearest_authors_for_authors.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/labs_lsa.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/vll_clusters_scores.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/nearest_articles_for_teams.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/labs_scores.csv.bz2
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/ml_clusters_labels.json
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/ml_clusters_eval_posscores.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/nearest_teams_for_words.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/ll_clusters_scores.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/nearest_labs_for_teams.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/ml_clusters_eval_negscores.csv.bz2
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/articles.index
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/teams_scores.csv.bz2
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/nearest_authors_for_teams.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/authors_mat.npz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/hl_clusters_umap.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/words_scores.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/lisn_2022.11.15.1_15d3d1d060e5f28d_lsa_200_True_umap_euclidean_20_0.1_random_1.0_None_None.png
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/ml_clusters_eval_negscores.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/lisn_2022.11.15.1_15d3d1d060e5f28d_lsa_200_True_umap_euclidean_20_0.1_random_1.0_None_None_kmeans_ll_clusters.png
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/words_lsa.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/teams_umap.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/nearest_labs_for_words.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/lisn_2022.11.15.1_15d3d1d060e5f28d_lsa_200_True_umap_euclidean_20_0.1_random_1.0_None_None_kmeans_vll_clusters.png
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/nearest_authors_for_articles.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/labs_umap.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/ll_clusters_eval_posscores.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/ml_clusters_lsa.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/articles.index.dat
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/labs_scores.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/authors_scores.csv.bz2
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/vll_clusters_lsa.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/ll_clusters_eval_posscores.csv.bz2
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/authors_umap.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/articles_scores.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/vll_clusters_eval_posscores.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/teams_mat.npz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/vll_clusters_eval_posscores.csv.bz2
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/nearest_articles_for_articles.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/hl_clusters_labels.json
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/hl_clusters_eval_posscores.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/lisn_2022.11.15.1_15d3d1d060e5f28d_lsa_200_True_umap_euclidean_20_0.1_random_1.0_None_None_kmeans_ml_clusters.png
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/labs.index
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/lsa_components.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/teams.index
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/nearest_articles_for_words.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/nearest_labs_for_articles.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/nearest_articles_for_authors.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/nearest_words_for_teams.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/hl_clusters_eval_negscores.csv.bz2
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/nearest_teams_for_authors.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/lisn_2022.11.15.1_15d3d1d060e5f28d_lsa_200_True_umap_euclidean_20_0.1_random_1.0_None_None_kmeans_hl_clusters.png
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/authors.index
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/teams_lsa.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/articles_mat.npz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/hl_clusters_scores.csv.bz2
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/ml_clusters_eval_posscores.csv.bz2
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/nearest_words_for_words.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/words_scores.csv.bz2
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/teams.index.dat
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/labs_mat.npz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/nearest_teams_for_labs.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/ll_clusters_eval_negscores.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/ml_clusters_scores.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/articles_scores.csv.bz2
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/vll_clusters_eval_negscores.csv.bz2
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/authors_scores.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/ll_clusters_eval_negscores.csv.bz2
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/authors.index.dat
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/nearest_words_for_articles.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/nearest_words_for_authors.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/labs.index.dat
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/nearest_teams_for_teams.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/ll_clusters_lsa.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/nearest_labs_for_labs.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/articles_umap.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/ml_clusters_umap.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/export.feather
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/vll_clusters_scores.csv.bz2

Let’s view the export.feather file:

import pandas as pd  # noqa

df = pd.read_feather(pipeline.working_dir / "export.feather")
df.head()
nature label score rank x y nn_articles nn_authors nn_teams nn_labs nn_words labs authors year url
0 articles Termination and Confluence of Higher-Order Rew... 1.0 0 2.062684 0.103818 0,30,662,13,17,15,432,4,1996,2049,2909,3270,20... 4290,4263,4323,4305,4329,4368,4273,4280,4543,4... 4956,4964,4963,4974,4982,4969,4960,4977,4966,4... 5134,5115,5384,5014,5081,5036,5109,5296,5515,5... 7442,8525,7440,9211,9948,8627,9793,7591,6012,9... 4992,4991,4990,4989, 2000 inria-00105556
1 articles Efficient Self-stabilization 1.0 1 2.134408 3.553994 1,60,358,22,878,233,409,1941,198,893,7,212,34,... 4262,4273,4271,4303,4385,4282,4268,4408,4538,4... 4956,4964,4963,4974,4982,4957,4969,4977,4960,4... 5024,5110,5103,5012,5031,5425,5185,5194,5174,5... 9560,9561,9897,6756,6757,9972,9144,8556,7655,6... 4992,4991,4990,4989, 4262 2000 tel-00124843
2 articles Resource-bounded relational reasoning: inducti... 1.0 2 2.999778 6.918508 2,134,196,11,2615,581,2198,289,286,1463,2001,2... 4263,4329,4345,4368,4273,4411,4543,4731,4310,4... 4956,4964,4963,4974,4982,4969,4960,4977,4966,4... 5070,5116,5056,5225,5349,5384,5081,5141,5458,5... 7926,9072,7601,8839,7591,8980,8412,7590,8573,7... 4992,4990,4989,4991,4994,4993, 4263 2000 hal-00111312
3 articles Reasoning about generalized intervals : Horn r... 1.0 3 0.200045 3.389739 3,98,2163,274,3016,1505,2995,3278,3064,2306,15... 4305,4323,4278,4273,4411,4340,4374,4322,4308,4... 4956,4964,4963,4974,4969,4977,4966,4960,4962,4... 5116,5003,5373,5117,5317,5004,5338,5292,5258,5... 7733,7321,5651,6461,6209,9868,6163,6426,8631,6... 4992,5005,5004,5003,5002,5001,5000,4999,4998,4... 2000 hal-03300321
4 articles Proof Nets and Explicit Substitutions 1.0 4 1.950340 0.059162 4,15,17,2909,2049,2545,30,1376,1545,3291,41,14... 4314,4334,4291,4278,4368,4480,4433,4371,4310,4... 4956,4964,4963,4974,4982,4969,4977,4960,4958,4... 5134,5384,5292,5317,5296,5115,5155,5006,5007,5... 6012,7807,9948,8627,9211,6376,9793,8280,9951,7... 4992,4990,4989,4991,5008,4994,5007,5006, 2000 hal-00384955


As we have saved the plots, we can view them.

We have 4 level clusters as defined in pipeline.yaml file corresponding to

  • hl_clusters,

  • ml_clusters,

  • ll_clusters,

  • vll_clusters.

for file in pipeline.working_dir.glob("*.png"):
    print(file)

""
import matplotlib.pyplot as plt  # noqa

image_names = []
for nature in pipeline.clustering.natures:
    image_title_parts = pipeline.title_parts_clus(nature)
    image_names.append("_".join(image_title_parts) + ".png")

rows = 2
columns = 2

f, ax = plt.subplots(rows, columns, figsize=(30, 10*rows))

for i, image_name in enumerate(image_names):
    img = plt.imread(pipeline.working_dir / image_name)

    row_num = i // rows
    col_num = i - row_num * columns

    ax[row_num][col_num].imshow(img)
    ax[row_num][col_num].axis('off')

plt.tight_layout()
plt.show()
pipeline yaml lisn
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/lisn_2022.11.15.1_15d3d1d060e5f28d_lsa_200_True_umap_euclidean_20_0.1_random_1.0_None_None.png
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/lisn_2022.11.15.1_15d3d1d060e5f28d_lsa_200_True_umap_euclidean_20_0.1_random_1.0_None_None_kmeans_ll_clusters.png
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/lisn_2022.11.15.1_15d3d1d060e5f28d_lsa_200_True_umap_euclidean_20_0.1_random_1.0_None_None_kmeans_vll_clusters.png
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/lisn_2022.11.15.1_15d3d1d060e5f28d_lsa_200_True_umap_euclidean_20_0.1_random_1.0_None_None_kmeans_ml_clusters.png
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/lisn_2022.11.15.1_15d3d1d060e5f28d_lsa_200_True_umap_euclidean_20_0.1_random_1.0_None_None_kmeans_hl_clusters.png

We have run the pipeline with the configuration values in the pipeline.yaml file.

Make modifications to Pipeline loaded from YAML

We can make modifications on the pipeline instance and rerun the pipeline using pipeline.run or alternatively we can run each step individually.

In the rest of this notebook, we will run each step ourselves changing some parameters.

The current pipeline has natures:

pipeline.natures
['articles', 'authors', 'teams', 'labs', 'words']

In dataset.yaml we can see that words column uses CorpusColumn. Let’s verify:

dataset_columns = pipeline.dataset.columns
dataset_columns
[<cartodata.pipeline.columns.IdentityColumn object at 0x7efd55925430>, <cartodata.pipeline.columns.CSColumn object at 0x7efc85c55700>, <cartodata.pipeline.columns.CSColumn object at 0x7efc85c55730>, <cartodata.pipeline.columns.CSColumn object at 0x7efc85c559d0>, <cartodata.pipeline.columns.CorpusColumn object at 0x7efc85c55460>]

The third column of the dataset is of type cartodata.pipeline.columns.CorpusColumn. We will view the documentation for CorpusColumn.

from cartodata.pipeline.columns import CorpusColumn  # noqa

help(CorpusColumn)
Help on class CorpusColumn in module cartodata.pipeline.columns:

class CorpusColumn(Column)
 |  CorpusColumn(nature, column_names, stopwords=None, english=True, nb_grams=4, min_df=1, max_df=1.0, vocab_sample=None, max_features=None, strip_accents='unicode', lowercase=True, min_word_length=5, normalize=True, vocabulary=None)
 |
 |  A column class that represents an entity specified by multiple text
 |  columns in the dataset.
 |
 |  For `CorpusColumn`, the entity matrix rows correspond to the rows of
 |  IdentityColumn. For each n-gram extracted from combined text of specified
 |  columns, a column is added to the matrix.
 |
 |  Attributes
 |  ----------
 |  nature: str
 |      The nature of the column entity
 |  column_names: list
 |      The list of column names in the dataset
 |  type: Columns
 |      The type of this column. Its value should be Columns.CORPUS
 |  stopwords: str
 |     The name of the stopwords file, or the URL for file that starts with
 |  http. It should be located under input_dir of the dataset.
 |  english: bool
 |      If True it will get union of the specified stopwords file with
 |  `sklearn.feature_extraction.text.ENGLISH_STOP_WORDS`.
 |  nb_grams: int
 |      Maximum n for n-grams.
 |  min_df: float in range [0.0, 1.0] or int
 |      When building the vocabulary ignore terms that have a document
 |  frequency strictly lower than the given threshold. This value is also
 |  called cut-off in the literature. If float, the parameter represents a
 |  proportion of documents, integer absolute counts. This parameter is
 |  ignored if vocabulary is not None.
 |  max_df: float in range [0.0, 1.0] or int
 |      When building the vocabulary ignore terms that have a document
 |  frequency strictly higher than the given threshold (corpus-specific
 |  stopwords). If float, the parameter represents a proportion of documents,
 |  integer absolute counts. This parameter is ignored if vocabulary is not
 |  None.
 |  vocab_sample: int
 |      Sample size from the corpus to train the vectorizer
 |  max_features:int
 |      If not None, build a vocabulary that only consider the top
 |  max_features ordered by term frequency across the corpus. Otherwise, all
 |  features are used.
 |  strip_accents: {‘ascii’, ‘unicode’} or callable
 |  lowercase: bool
 |      Convert all characters to lowercase before tokenizing
 |  min_word_length: int
 |      Minimum word length
 |  normalize: bool
 |      Normalizes the returned matrix if set to True
 |  vocabulary: str or set
 |      Either a set of terms or a file name that contains the terms one in
 |  each line, or the URL for file that starts with http.
 |
 |  Methods
 |  -------
 |  load()
 |      Loads and processes the column specified by `column_name` in the
 |  dataset and generates the entity and scores matrices.
 |  get_corpus(df)
 |      Creates a text column merging the values of the columns putting
 |  " . " between column values specified for this column.
 |
 |
 |  Corpus column uses custom
 |  `cartodata.loading.PunctuationCountVectorizer` which splits the
 |  document on "[.,;]" and builds n-grams from each segment separately.
 |
 |  Method resolution order:
 |      CorpusColumn
 |      Column
 |      cartodata.pipeline.base.BaseEntity
 |      builtins.object
 |
 |  Methods defined here:
 |
 |  __init__(self, nature, column_names, stopwords=None, english=True, nb_grams=4, min_df=1, max_df=1.0, vocab_sample=None, max_features=None, strip_accents='unicode', lowercase=True, min_word_length=5, normalize=True, vocabulary=None)
 |      Parameters
 |      ----------
 |      nature: str
 |          The nature of the column entity
 |      column_names: list
 |          The list of column names in the dataset
 |      type: Columns
 |          The type of this column. Its value should be Columns.CORPUS
 |      stopwords: str, default=None
 |         The name of the stopwords file, or the URL for file that starts with
 |      http. It should be located under input_dir of the dataset.
 |      english: bool, default=True
 |          If True it will get union of the specified stopwords file with
 |      `sklearn.feature_extraction.text.ENGLISH_STOP_WORDS`.
 |      nb_grams: int, default=4
 |          Maximum n for n-grams.
 |      min_df: float in range [0.0, 1.0] or int, default=1
 |          When building the vocabulary ignore terms that have a document
 |      frequency strictly lower than the given threshold. This value is also
 |      called cut-off in the literature. If float, the parameter represents a
 |      proportion of documents, integer absolute counts. This parameter is
 |      ignored if vocabulary is not None.
 |      max_df: float in range [0.0, 1.0] or int, default=1.0
 |          When building the vocabulary ignore terms that have a document
 |      frequency strictly higher than the given threshold (corpus-specific
 |      stopwords). If float, the parameter represents a proportion of
 |      documents, integer absolute counts. This parameter is ignored if
 |      vocabulary is not None.
 |      vocab_sample: int, default=None
 |          Sample size from the corpus to train the vectorizer
 |      max_features:int, default=None
 |          If not None, build a vocabulary that only consider the top
 |      max_features ordered by term frequency across the corpus. Otherwise,
 |      all features are used. If vocabulary is not None, this parameter is
 |      ignored.
 |      strip_accents: {‘ascii’, ‘unicode’} or callable, default=unicode
 |      lowercase: bool, default=True
 |          Convert all characters to lowercase before tokenizing
 |      min_word_length: int, default 5
 |          Minimum word length
 |      normalize: bool, default=True
 |          Normalizes the returned matrix if set to True
 |      vocabulary: str or set, default=None
 |          Either a set of terms or a file name that contains the terms one in
 |      each line, or the URL for file that starts with http.
 |
 |  get_corpus(self, df)
 |      Creates a text column merging the values of the columns putting
 |      " . " between column values specified for this column.
 |
 |      Returns
 |      -------
 |      pandas.Series
 |          a Series that contains merged text from the columns in CospusColumn
 |
 |  load(self, dataset)
 |      Loads and processes the column specified by `column_name` in the
 |      dataset and generates the entity and scores matrices.
 |
 |      Parameters
 |      ----------
 |      dataset: cartodata.pipeline.datasets.Dataset
 |          The dataset from which the column will be loaded
 |
 |      Returns
 |      -------
 |          A sparse matrix and a pandas series
 |
 |  ----------------------------------------------------------------------
 |  Data and other attributes defined here:
 |
 |  yaml_tag = '!CorpusColumn'
 |
 |  ----------------------------------------------------------------------
 |  Readonly properties inherited from Column:
 |
 |  params
 |      Returns the parameter values of this estimator as string
 |      contatenated with ``_`` character.
 |
 |      Returns
 |      -------
 |      str
 |
 |  ----------------------------------------------------------------------
 |  Readonly properties inherited from cartodata.pipeline.base.BaseEntity:
 |
 |  params_values
 |      Returns all parameter name-parameter value pairs as a dictionary.
 |
 |      Returns
 |      -------
 |      dict
 |
 |  phase
 |
 |  ----------------------------------------------------------------------
 |  Data descriptors inherited from cartodata.pipeline.base.BaseEntity:
 |
 |  __dict__
 |      dictionary for instance variables (if defined)
 |
 |  __weakref__
 |      list of weak references to the object (if defined)

CorpusColumn uses custom PunctuationCountVectorizer to build the n-grams. Let’s say instead of using PunctuationCountVectorizer we would like to use sklearn.feature_extraction.TfidfVectorizer.

We will define a custom column class as follows:

import pandas as pd  # noqa
import numpy as np  # noqa
from sklearn.feature_extraction.text import TfidfVectorizer  # noqa

from cartodata.operations import normalize_tfidf  # noqa
from cartodata.pipeline.columns import CorpusColumn  # noqa


class TfidfCorpusColumn(CorpusColumn):

    def __init__(self, nature, column_names, stopwords, english=True,
                 nb_grams=4, min_df=10, max_df=0.05, vocab_sample=None,
                 max_features=None, strip_accents="unicode",
                 lowercase=True, min_word_length=5, normalize=True,
                 input='content', binary=True, encoding="utf-8"):

        super().__init__(nature, column_names, stopwords, english, nb_grams,
                         min_df, max_df, vocab_sample, max_features,
                         strip_accents, lowercase, min_word_length, normalize)
        self.binary = binary
        self.encoding = encoding
        self.input = input

    def load(self, dataset):

        df = dataset.df
        stopwords = self._load_stopwords(
            dataset.input_dir, self.stopwords, self.english, dataset.name
        )
        corpus = df[self.column_names].apply(
            lambda row: ' . '.join(row.values.astype(str)), axis=1)

        tf_vectorizer = TfidfVectorizer(
            input=self.input, ngram_range=(1, self.nb_grams),
            min_df=self.min_df, max_df=self.max_df, binary=True,
            strip_accents=self.strip_accents, encoding=self.encoding,
            stop_words=stopwords)

        matrix = tf_vectorizer.fit_transform(corpus)

        scores = pd.Series(np.bincount(matrix.indices,
                                       minlength=matrix.shape[1]),
                           index=tf_vectorizer.get_feature_names_out())
        if self.normalize:
            matrix = normalize_tfidf(matrix)

        return matrix, scores

dataset_columns contains a shallow copy of the columns of the dataset. Now we will initiate a new words_column to replace the existing one.

words_column = TfidfCorpusColumn(nature="words",
                                 column_names=["en_abstract_s", "en_title_s",
                                               "en_keyword_s", "en_domainAllCodeLabel_fs"],
                                 stopwords="stopwords.txt", nb_grams=4, min_df=10,
                                 max_df=0.05, min_word_length=5, normalize=True)

dataset_columns[4] = words_column
dataset_columns

""
pipeline.working_dir
PosixPath('/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1')

We see that the type of the words column has changed. Now we have to set it to the dataset (as the dataset_columns is a shallow copy) and verify pipeline’s natures and dataset’s columns.

pipeline.dataset.set_columns(dataset_columns)

print(pipeline.natures)

""
print(pipeline.dataset.columns)
['articles', 'authors', 'teams', 'labs', 'words']
[<cartodata.pipeline.columns.IdentityColumn object at 0x7efd55925430>, <cartodata.pipeline.columns.CSColumn object at 0x7efc85c55700>, <cartodata.pipeline.columns.CSColumn object at 0x7efc85c55730>, <cartodata.pipeline.columns.CSColumn object at 0x7efc85c559d0>, <__main__.TfidfCorpusColumn object at 0x7efc78a47e80>]

In the pipeline.working_dir we have the dumps from the previous run.

for file in pipeline.working_dir.iterdir():
    print(file)
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/nearest_teams_for_articles.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/words_mat.npz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/words.index
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/articles_lsa.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/ll_clusters_umap.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/ml_clusters_scores.csv.bz2
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/teams_scores.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/vll_clusters_eval_negscores.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/vll_clusters_umap.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/nearest_words_for_labs.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/authors_lsa.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/nearest_labs_for_authors.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/hl_clusters_scores.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/words_umap.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/words.index.dat
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/ll_clusters_labels.json
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/nearest_articles_for_labs.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/ll_clusters_scores.csv.bz2
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/hl_clusters_lsa.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/vll_clusters_labels.json
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/nearest_authors_for_labs.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/hl_clusters_eval_posscores.csv.bz2
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/hl_clusters_eval_negscores.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/nearest_authors_for_words.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/nearest_authors_for_authors.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/labs_lsa.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/vll_clusters_scores.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/nearest_articles_for_teams.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/labs_scores.csv.bz2
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/ml_clusters_labels.json
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/ml_clusters_eval_posscores.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/nearest_teams_for_words.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/ll_clusters_scores.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/nearest_labs_for_teams.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/ml_clusters_eval_negscores.csv.bz2
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/articles.index
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/teams_scores.csv.bz2
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/nearest_authors_for_teams.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/authors_mat.npz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/hl_clusters_umap.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/words_scores.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/lisn_2022.11.15.1_15d3d1d060e5f28d_lsa_200_True_umap_euclidean_20_0.1_random_1.0_None_None.png
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/ml_clusters_eval_negscores.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/lisn_2022.11.15.1_15d3d1d060e5f28d_lsa_200_True_umap_euclidean_20_0.1_random_1.0_None_None_kmeans_ll_clusters.png
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/words_lsa.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/teams_umap.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/nearest_labs_for_words.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/lisn_2022.11.15.1_15d3d1d060e5f28d_lsa_200_True_umap_euclidean_20_0.1_random_1.0_None_None_kmeans_vll_clusters.png
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/nearest_authors_for_articles.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/labs_umap.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/ll_clusters_eval_posscores.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/ml_clusters_lsa.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/articles.index.dat
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/labs_scores.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/authors_scores.csv.bz2
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/vll_clusters_lsa.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/ll_clusters_eval_posscores.csv.bz2
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/authors_umap.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/articles_scores.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/vll_clusters_eval_posscores.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/teams_mat.npz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/vll_clusters_eval_posscores.csv.bz2
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/nearest_articles_for_articles.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/hl_clusters_labels.json
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/hl_clusters_eval_posscores.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/lisn_2022.11.15.1_15d3d1d060e5f28d_lsa_200_True_umap_euclidean_20_0.1_random_1.0_None_None_kmeans_ml_clusters.png
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/labs.index
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/lsa_components.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/teams.index
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/nearest_articles_for_words.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/nearest_labs_for_articles.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/nearest_articles_for_authors.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/nearest_words_for_teams.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/hl_clusters_eval_negscores.csv.bz2
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/nearest_teams_for_authors.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/lisn_2022.11.15.1_15d3d1d060e5f28d_lsa_200_True_umap_euclidean_20_0.1_random_1.0_None_None_kmeans_hl_clusters.png
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/authors.index
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/teams_lsa.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/articles_mat.npz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/hl_clusters_scores.csv.bz2
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/ml_clusters_eval_posscores.csv.bz2
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/nearest_words_for_words.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/words_scores.csv.bz2
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/teams.index.dat
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/labs_mat.npz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/nearest_teams_for_labs.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/ll_clusters_eval_negscores.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/ml_clusters_scores.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/articles_scores.csv.bz2
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/vll_clusters_eval_negscores.csv.bz2
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/authors_scores.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/ll_clusters_eval_negscores.csv.bz2
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/authors.index.dat
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/nearest_words_for_articles.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/nearest_words_for_authors.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/labs.index.dat
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/nearest_teams_for_teams.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/ll_clusters_lsa.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/nearest_labs_for_labs.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/articles_umap.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/ml_clusters_umap.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/export.feather
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/vll_clusters_scores.csv.bz2

We do not want to use the dumps from previous run. We can either remove the files or run pipeline.generate_entity_matrices with force=True.

We will generate the entity matrices with the new words column and save:

matrices, scores = pipeline.generate_entity_matrices(force=True)
/usr/local/lib/python3.9/site-packages/sklearn/feature_extraction/text.py:406: UserWarning: Your stop_words may be inconsistent with your preprocessing. Tokenizing the stop words generated tokens ['ii', 'iii'] not in stop_words.
  warnings.warn(

In the pipeline.yaml file, the pipeline is configured to use cartodata.pipeline.projectionnd.LSAProjection.

pipeline.projection_nd
<cartodata.pipeline.projectionnd.LSAProjection object at 0x7efc94e6ca00>

Instead of LSAProjection, this time we would like to use cartodata.pipeline.projectionnd.LDAProjection.

from cartodata.pipeline.projectionnd import LDAProjection  # noqa

projection_nd = LDAProjection(num_dims=80)
pipeline.set_projection_nd(projection_nd)

matrices_nD = pipeline.do_projection_nD(force=True)

""
for file in pipeline.working_dir.glob("*.lda"):
    print(file)

Then we will project to 2D. We should not forget to set force=True to make sure that 2D projection is rerun and does not use the saved dumps.

matrices_2D = pipeline.do_projection_2D(force=True)

View the generated map:

from matplotlib.colors import TABLEAU_COLORS  # noqa
from matplotlib.lines import Line2D  # noqa

labels = tuple(pipeline.natures)
colors = list(TABLEAU_COLORS)[:len(matrices)]

pipeline.plot_map(matrices_2D, labels, colors)
pipeline yaml lisn
(<Figure size 960x640 with 1 Axes>, <Axes: >)

Now we have all necessary matrices to create clusters and neighbors. First we will create clusters:

(clus_nD, clus_2D, clus_scores, cluster_labels,
cluster_eval_pos, cluster_eval_neg) = pipeline.do_clustering(force=True)
Warning: Less than 2 words in cluster 35 with (1) words.
Warning: Less than 2 words in cluster 43 with (0) words.
Warning: Less than 2 words in cluster 43 with (0) words.
Warning: Less than 2 words in cluster 52 with (1) words.
Warning: Less than 2 words in cluster 52 with (1) words.
Warning: Less than 2 words in cluster 58 with (1) words.
Warning: Less than 2 words in cluster 58 with (1) words.
Warning: Less than 2 words in cluster 70 with (0) words.
Warning: Less than 2 words in cluster 74 with (0) words.
Warning: Less than 2 words in cluster 74 with (0) words.
Warning: Less than 2 words in cluster 78 with (1) words.
Warning: Less than 2 words in cluster 78 with (1) words.
Warning: Less than 2 words in cluster 86 with (0) words.
Warning: Less than 2 words in cluster 86 with (0) words.
Warning: Less than 2 words in cluster 108 with (1) words.
Warning: Less than 2 words in cluster 108 with (1) words.
Warning: Less than 2 words in cluster 112 with (0) words.
Warning: Less than 2 words in cluster 112 with (0) words.
Warning: Less than 2 words in cluster 154 with (1) words.
Warning: Less than 2 words in cluster 154 with (1) words.
Warning: Less than 2 words in cluster 193 with (1) words.
Warning: Less than 2 words in cluster 193 with (1) words.
Warning: Less than 2 words in cluster 213 with (0) words.
Warning: Less than 2 words in cluster 213 with (0) words.

We will view only medium levels clusters:

ml_index = 1
clus_scores_ml = clus_scores[ml_index]
clus_mat_ml = clus_2D[ml_index]

fig_title = (
    f"{pipeline.dataset.name} {pipeline.dataset.version} "
    f"{pipeline.clustering.natures[ml_index]} {pipeline.projection_nd.key}"
)

fig_ml, ax_ml = pipeline.plot_map(matrices_2D, labels, colors,
                                  title=fig_title,
                                  annotations=clus_scores_ml.index,
                                  annotation_mat=clus_mat_ml)
lisn 2022.11.15.1 ml_clusters lda

And save the plot:

pipeline.save_plots()
[(<Figure size 960x640 with 1 Axes>, 'lisn_2022.11.15.1_15d3d1d060e5f28d_lda_80_True_0_20_umap_euclidean_20_0.1_random_1.0_None_None_kmeans_hl_clusters.png'), (<Figure size 960x640 with 1 Axes>, 'lisn_2022.11.15.1_15d3d1d060e5f28d_lda_80_True_0_20_umap_euclidean_20_0.1_random_1.0_None_None_kmeans_ml_clusters.png'), (<Figure size 960x640 with 1 Axes>, 'lisn_2022.11.15.1_15d3d1d060e5f28d_lda_80_True_0_20_umap_euclidean_20_0.1_random_1.0_None_None_kmeans_ll_clusters.png'), (<Figure size 960x640 with 1 Axes>, 'lisn_2022.11.15.1_15d3d1d060e5f28d_lda_80_True_0_20_umap_euclidean_20_0.1_random_1.0_None_None_kmeans_vll_clusters.png')]

Let’s display two cluster images generated by two runs of the pipeline side by side:

image_title_parts = pipeline.title_parts_clus("ml_clusters")
img1 = plt.imread(pipeline.working_dir / image_names[1])
img2 = plt.imread(pipeline.working_dir / ("_".join(image_title_parts) + ".png"))

f, ax = plt.subplots(1, 2, figsize=(20, 10))

ax[0].imshow(img1)
ax[1].imshow(img2)

ax[0].axis('off')
ax[1].axis('off')

plt.tight_layout()
plt.show()
pipeline yaml lisn

On the left we see the plot of medium level clusters using LSA and UMAP, and on the right we see the results using LDA and UMAP.

Then we will find neighbors:

pipeline.find_neighbors()

Now we are ready to export data to export.feather file.

pipeline.export()

df = pd.read_feather(pipeline.working_dir / "export.feather")
df.head()
nature label score rank x y nn_articles nn_authors nn_teams nn_labs nn_words labs authors year url
0 articles Termination and Confluence of Higher-Order Rew... 1.0 0 4.364014 10.631992 0,30,662,13,17,15,432,4,1996,2049,2909,3270,20... 4290,4263,4323,4305,4329,4368,4273,4280,4543,4... 4956,4964,4963,4974,4982,4969,4960,4977,4966,4... 5134,5115,5384,5014,5081,5036,5109,5296,5515,5... 7442,8525,7440,9211,9948,8627,9793,7591,6012,9... 4992,4991,4990,4989, 2000 inria-00105556
1 articles Efficient Self-stabilization 1.0 1 5.481778 8.295186 1,60,358,22,878,233,409,1941,198,893,7,212,34,... 4262,4273,4271,4303,4385,4282,4268,4408,4538,4... 4956,4964,4963,4974,4982,4957,4969,4977,4960,4... 5024,5110,5103,5012,5031,5425,5185,5194,5174,5... 9560,9561,9897,6756,6757,9972,9144,8556,7655,6... 4992,4991,4990,4989, 4262 2000 tel-00124843
2 articles Resource-bounded relational reasoning: inducti... 1.0 2 5.520878 8.531384 2,134,196,11,2615,581,2198,289,286,1463,2001,2... 4263,4329,4345,4368,4273,4411,4543,4731,4310,4... 4956,4964,4963,4974,4982,4969,4960,4977,4966,4... 5070,5116,5056,5225,5349,5384,5081,5141,5458,5... 7926,9072,7601,8839,7591,8980,8412,7590,8573,7... 4992,4990,4989,4991,4994,4993, 4263 2000 hal-00111312
3 articles Reasoning about generalized intervals : Horn r... 1.0 3 2.178202 3.326003 3,98,2163,274,3016,1505,2995,3278,3064,2306,15... 4305,4323,4278,4273,4411,4340,4374,4322,4308,4... 4956,4964,4963,4974,4969,4977,4966,4960,4962,4... 5116,5003,5373,5117,5317,5004,5338,5292,5258,5... 7733,7321,5651,6461,6209,9868,6163,6426,8631,6... 4992,5005,5004,5003,5002,5001,5000,4999,4998,4... 2000 hal-03300321
4 articles Proof Nets and Explicit Substitutions 1.0 4 3.025347 11.557344 4,15,17,2909,2049,2545,30,1376,1545,3291,41,14... 4314,4334,4291,4278,4368,4480,4433,4371,4310,4... 4956,4964,4963,4974,4982,4969,4977,4960,4958,4... 5134,5384,5292,5317,5296,5115,5155,5006,5007,5... 6012,7807,9948,8627,9211,6376,9793,8280,9951,7... 4992,4990,4989,4991,5008,4994,5007,5006, 2000 hal-00384955


It is also possible to change the export configuration.

export_natures = pipeline.export_natures

The pipeline.yaml file defines export configuration for only articles and authors.

for nature in export_natures:
    print(nature.key)
articles
authors

Let’s say we would like to add year data to labs nature.

The original dataset contains a column producedDateY_i which contains the year that the article is published. We can add this data as metadata for the point but updating column name with a more clear alternative year.

We can map year data in the file to labs using EntityMetadataMapColumn.

from cartodata.pipeline.exporting import EntityMetadataMapColumn # noqa

meta_year_lab = EntityMetadataMapColumn(entity="labs", column="producedDateY_i",
                                        as_column="year")

Then we will initialize ExportNature for labs using meta_year_lab.

from cartodata.pipeline.exporting import ExportNature # noqa

ex_lab = ExportNature(key="labs",
                         add_metadata=[meta_year_lab])

new_export_natures = export_natures + [ex_lab]

""
pipeline.export(new_export_natures)

""
df = pd.read_feather(pipeline.working_dir/ "export.feather")

df[df["nature"] == "labs"].head(5)
nature label score rank x y nn_articles nn_authors nn_teams nn_labs nn_words labs authors year url
4989 labs LRI 4789.0 4989 4.909530 4.706197 655,1142,412,2066,2546,298,1816,3562,1278,864,... 4263,4290,4293,4341,4554,4305,4550,4323,4329,4... 4956,4964,4963,4974,4982,4969,4977,4960,4966,4... 4989,4990,4991,5008,4992,5062,5107,4993,5060,4... 9137,9157,6197,8866,9133,7894,6817,8438,6067,7... None None 2000,2001,2002,2003,2004,2005,2006,2007,2008,2... None
4990 labs UP11 6271.0 4990 4.696403 4.794846 655,1141,298,1142,2298,2387,1165,2098,1144,181... 4293,4263,4290,4341,4554,4305,4329,4550,4314,4... 4956,4964,4963,4974,4982,4969,4977,4960,4962,4... 4990,4989,4991,5008,4992,4993,5062,5030,5012,5... 9137,9157,6197,7894,8438,9133,7957,7542,6363,5... None None 2000,2001,2002,2003,2004,2005,2006,2007,2008,2... None
4991 labs CNRS 10217.0 4991 4.986214 4.539379 3179,2387,2082,1032,1816,551,3744,607,3073,107... 4263,4290,4293,4341,4550,4305,4706,4329,4323,4... 4956,4964,4963,4974,4982,4969,4977,4960,4962,4... 4991,4989,4990,4992,5008,5062,4994,4993,5107,5... 9157,8656,9137,6197,7953,8438,6661,7894,7531,5... None None 2000,2001,2002,2003,2004,2005,2006,2007,2008,2... None
4992 labs LISN 5203.0 4992 5.225499 3.964251 4154,3642,3973,3848,4142,4025,3447,4221,3649,3... 4290,4263,4293,4550,4554,4341,4706,4323,4305,4... 4956,4964,4963,4974,4982,4969,4977,4960,4966,4... 4992,4991,4989,5008,4990,5059,5005,5151,5062,4... 7531,8656,6067,6661,5592,9157,5593,7894,7913,5... None None 2000,2001,2002,2003,2004,2005,2006,2007,2008,2... None
4993 labs X 487.0 4993 1.629990 2.366460 1570,837,834,1307,45,2325,1398,1226,2434,2298,... 4292,4281,4263,4572,4314,4330,4661,4463,4424,4... 4963,4956,4964,4974,4982,4969,4977,4962,4960,4... 4993,5030,5170,5421,5382,5200,5424,5410,5165,5... 8897,8896,9318,9316,5934,9317,8296,8874,8295,7... None None 2000,2001,2002,2003,2004,2005,2006,2007,2008,2... None


Total running time of the script: (3 minutes 30.250 seconds)

Gallery generated by Sphinx-Gallery