Note

Go to the end to download the full example code.

Using Pipeline API with YAML file¶

This notebook demonstrates the usage of Pipeline API by loading pipeline from pipeline.yaml file.

For creating a pipeline using Pipeline constructor, please see the notebook [pipeline_lisn_lsa_kmeans.py](./pipeline_lisn_lsa_kmeans.py).

YAML files that define configuration for the Pipeline are located in the ../conf directory. For each dataset, there exists a directory under the conf directory, and in the dataset directory, we have dataset.yaml and pipeline.yaml files.

We can list all the datasets where a pipeline is defined:

from pathlib import Path

ROOT_DIR = Path.cwd().parent

CONF_DIR = ROOT_DIR / "conf"

""
# !ls $CONF_DIR

''

We will work with lisn dataset file.

# !ls $CONF_DIR/lisn

Here we see both a dataset.yaml and pipeline.yaml. dataset.yaml is used by pipeline.yaml to initialize lisn dataset. Let’s see the contents of this file:

# !cat $CONF_DIR/lisn/dataset.yaml

Let’s view the contents of the pipeline.yaml file:

# !cat $CONF_DIR/lisn/pipeline.yaml

Load Pipeline from YAML file¶

We can load this pipeline using cartodata.pipeline.loader.load_pipeline.

We will run this in the project’s root directory (parent directory of the examples directory) as the paths in the pipeline.yaml file are relative to the project’s root directory.

from cartodata.pipeline.loader import load_pipeline  # noqa
import os  # noqa
from pathlib import Path # noqa

# run in parent directory
os.chdir("../")

current_dir = Path(".").absolute()
conf_dir = current_dir / "conf"

pipeline = load_pipeline("lisn", conf_dir)

We can simply run the pipeline with the configuration specified in pipeline.yaml file.

We do not want to use the dump files from a previous run, so we set force=True and to see the results of the run, we would like to save the plots.

pipeline.run(save_plots=True, force=True)

Downloading data from https://zenodo.org/records/7323538/files/lisn_2000_2022.csv (6.3 MB)


file_sizes:   0%|                                   | 0.00/6.59M [00:00<?, ?B/s]
file_sizes:  45%|███████████▋              | 2.97M/6.59M [00:00<00:00, 28.9MB/s]
file_sizes: 100%|██████████████████████████| 6.59M/6.59M [00:00<00:00, 47.8MB/s]
Successfully downloaded file to dumps/lisn/2022.11.15.1/lisn_2000_2022.csv

This runs the following steps:

loads the dataset data from file
generates entity matrices for all natures and saves them
executes projection and saves the results
executes 2D projection and saves the results
creates clusters
finds neighbors
saves all data to export.feather file

The results are saved under pipeline.working_dir.

for file in pipeline.working_dir.iterdir():
    print(file)

/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/nearest_teams_for_articles.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/words_mat.npz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/words.index
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/articles_lsa.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/ll_clusters_umap.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/ml_clusters_scores.csv.bz2
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/teams_scores.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/vll_clusters_eval_negscores.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/vll_clusters_umap.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/nearest_words_for_labs.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/authors_lsa.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/nearest_labs_for_authors.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/hl_clusters_scores.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/words_umap.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/words.index.dat
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/ll_clusters_labels.json
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/nearest_articles_for_labs.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/ll_clusters_scores.csv.bz2
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/hl_clusters_lsa.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/vll_clusters_labels.json
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/nearest_authors_for_labs.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/hl_clusters_eval_posscores.csv.bz2
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/hl_clusters_eval_negscores.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/nearest_authors_for_words.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/nearest_authors_for_authors.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/labs_lsa.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/vll_clusters_scores.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/nearest_articles_for_teams.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/labs_scores.csv.bz2
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/ml_clusters_labels.json
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/ml_clusters_eval_posscores.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/nearest_teams_for_words.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/ll_clusters_scores.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/nearest_labs_for_teams.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/ml_clusters_eval_negscores.csv.bz2
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/articles.index
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/teams_scores.csv.bz2
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/nearest_authors_for_teams.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/authors_mat.npz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/hl_clusters_umap.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/words_scores.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/lisn_2022.11.15.1_15d3d1d060e5f28d_lsa_200_True_umap_euclidean_20_0.1_random_1.0_None_None.png
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/ml_clusters_eval_negscores.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/lisn_2022.11.15.1_15d3d1d060e5f28d_lsa_200_True_umap_euclidean_20_0.1_random_1.0_None_None_kmeans_ll_clusters.png
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/words_lsa.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/teams_umap.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/nearest_labs_for_words.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/lisn_2022.11.15.1_15d3d1d060e5f28d_lsa_200_True_umap_euclidean_20_0.1_random_1.0_None_None_kmeans_vll_clusters.png
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/nearest_authors_for_articles.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/labs_umap.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/ll_clusters_eval_posscores.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/ml_clusters_lsa.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/articles.index.dat
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/labs_scores.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/authors_scores.csv.bz2
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/vll_clusters_lsa.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/ll_clusters_eval_posscores.csv.bz2
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/authors_umap.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/articles_scores.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/vll_clusters_eval_posscores.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/teams_mat.npz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/vll_clusters_eval_posscores.csv.bz2
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/nearest_articles_for_articles.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/hl_clusters_labels.json
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/hl_clusters_eval_posscores.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/lisn_2022.11.15.1_15d3d1d060e5f28d_lsa_200_True_umap_euclidean_20_0.1_random_1.0_None_None_kmeans_ml_clusters.png
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/labs.index
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/lsa_components.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/teams.index
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/nearest_articles_for_words.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/nearest_labs_for_articles.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/nearest_articles_for_authors.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/nearest_words_for_teams.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/hl_clusters_eval_negscores.csv.bz2
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/nearest_teams_for_authors.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/lisn_2022.11.15.1_15d3d1d060e5f28d_lsa_200_True_umap_euclidean_20_0.1_random_1.0_None_None_kmeans_hl_clusters.png
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/authors.index
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/teams_lsa.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/articles_mat.npz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/hl_clusters_scores.csv.bz2
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/ml_clusters_eval_posscores.csv.bz2
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/nearest_words_for_words.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/words_scores.csv.bz2
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/teams.index.dat
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/labs_mat.npz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/nearest_teams_for_labs.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/ll_clusters_eval_negscores.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/ml_clusters_scores.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/articles_scores.csv.bz2
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/vll_clusters_eval_negscores.csv.bz2
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/authors_scores.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/ll_clusters_eval_negscores.csv.bz2
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/authors.index.dat
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/nearest_words_for_articles.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/nearest_words_for_authors.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/labs.index.dat
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/nearest_teams_for_teams.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/ll_clusters_lsa.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/nearest_labs_for_labs.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/articles_umap.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/ml_clusters_umap.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/export.feather
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/vll_clusters_scores.csv.bz2

Let’s view the export.feather file:

import pandas as pd  # noqa

df = pd.read_feather(pipeline.working_dir / "export.feather")
df.head()

	nature	label	score	rank	x	y	nn_articles	nn_authors	nn_teams	nn_labs	nn_words	labs	authors	year	url
0	articles	Termination and Confluence of Higher-Order Rew...	1.0	0	2.062684	0.103818	0,30,662,13,17,15,432,4,1996,2049,2909,3270,20...	4290,4263,4323,4305,4329,4368,4273,4280,4543,4...	4956,4964,4963,4974,4982,4969,4960,4977,4966,4...	5134,5115,5384,5014,5081,5036,5109,5296,5515,5...	7442,8525,7440,9211,9948,8627,9793,7591,6012,9...	4992,4991,4990,4989,		2000	inria-00105556
1	articles	Efficient Self-stabilization	1.0	1	2.134408	3.553994	1,60,358,22,878,233,409,1941,198,893,7,212,34,...	4262,4273,4271,4303,4385,4282,4268,4408,4538,4...	4956,4964,4963,4974,4982,4957,4969,4977,4960,4...	5024,5110,5103,5012,5031,5425,5185,5194,5174,5...	9560,9561,9897,6756,6757,9972,9144,8556,7655,6...	4992,4991,4990,4989,	4262	2000	tel-00124843
2	articles	Resource-bounded relational reasoning: inducti...	1.0	2	2.999778	6.918508	2,134,196,11,2615,581,2198,289,286,1463,2001,2...	4263,4329,4345,4368,4273,4411,4543,4731,4310,4...	4956,4964,4963,4974,4982,4969,4960,4977,4966,4...	5070,5116,5056,5225,5349,5384,5081,5141,5458,5...	7926,9072,7601,8839,7591,8980,8412,7590,8573,7...	4992,4990,4989,4991,4994,4993,	4263	2000	hal-00111312
3	articles	Reasoning about generalized intervals : Horn r...	1.0	3	0.200045	3.389739	3,98,2163,274,3016,1505,2995,3278,3064,2306,15...	4305,4323,4278,4273,4411,4340,4374,4322,4308,4...	4956,4964,4963,4974,4969,4977,4966,4960,4962,4...	5116,5003,5373,5117,5317,5004,5338,5292,5258,5...	7733,7321,5651,6461,6209,9868,6163,6426,8631,6...	4992,5005,5004,5003,5002,5001,5000,4999,4998,4...		2000	hal-03300321
4	articles	Proof Nets and Explicit Substitutions	1.0	4	1.950340	0.059162	4,15,17,2909,2049,2545,30,1376,1545,3291,41,14...	4314,4334,4291,4278,4368,4480,4433,4371,4310,4...	4956,4964,4963,4974,4982,4969,4977,4960,4958,4...	5134,5384,5292,5317,5296,5115,5155,5006,5007,5...	6012,7807,9948,8627,9211,6376,9793,8280,9951,7...	4992,4990,4989,4991,5008,4994,5007,5006,		2000	hal-00384955

As we have saved the plots, we can view them.

We have 4 level clusters as defined in pipeline.yaml file corresponding to

hl_clusters,
ml_clusters,
ll_clusters,
vll_clusters.

for file in pipeline.working_dir.glob("*.png"):
    print(file)

""
import matplotlib.pyplot as plt  # noqa

image_names = []
for nature in pipeline.clustering.natures:
    image_title_parts = pipeline.title_parts_clus(nature)
    image_names.append("_".join(image_title_parts) + ".png")

rows = 2
columns = 2

f, ax = plt.subplots(rows, columns, figsize=(30, 10*rows))

for i, image_name in enumerate(image_names):
    img = plt.imread(pipeline.working_dir / image_name)

    row_num = i // rows
    col_num = i - row_num * columns

    ax[row_num][col_num].imshow(img)
    ax[row_num][col_num].axis('off')

plt.tight_layout()
plt.show()

/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/lisn_2022.11.15.1_15d3d1d060e5f28d_lsa_200_True_umap_euclidean_20_0.1_random_1.0_None_None.png
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/lisn_2022.11.15.1_15d3d1d060e5f28d_lsa_200_True_umap_euclidean_20_0.1_random_1.0_None_None_kmeans_ll_clusters.png
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/lisn_2022.11.15.1_15d3d1d060e5f28d_lsa_200_True_umap_euclidean_20_0.1_random_1.0_None_None_kmeans_vll_clusters.png
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/lisn_2022.11.15.1_15d3d1d060e5f28d_lsa_200_True_umap_euclidean_20_0.1_random_1.0_None_None_kmeans_ml_clusters.png
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/lisn_2022.11.15.1_15d3d1d060e5f28d_lsa_200_True_umap_euclidean_20_0.1_random_1.0_None_None_kmeans_hl_clusters.png

We have run the pipeline with the configuration values in the pipeline.yaml file.

Make modifications to Pipeline loaded from YAML¶

We can make modifications on the pipeline instance and rerun the pipeline using pipeline.run or alternatively we can run each step individually.

In the rest of this notebook, we will run each step ourselves changing some parameters.

The current pipeline has natures:

pipeline.natures

['articles', 'authors', 'teams', 'labs', 'words']

In dataset.yaml we can see that words column uses CorpusColumn. Let’s verify:

dataset_columns = pipeline.dataset.columns
dataset_columns

[<cartodata.pipeline.columns.IdentityColumn object at 0x7efd55925430>, <cartodata.pipeline.columns.CSColumn object at 0x7efc85c55700>, <cartodata.pipeline.columns.CSColumn object at 0x7efc85c55730>, <cartodata.pipeline.columns.CSColumn object at 0x7efc85c559d0>, <cartodata.pipeline.columns.CorpusColumn object at 0x7efc85c55460>]

The third column of the dataset is of type cartodata.pipeline.columns.CorpusColumn. We will view the documentation for CorpusColumn.

from cartodata.pipeline.columns import CorpusColumn  # noqa

help(CorpusColumn)

Help on class CorpusColumn in module cartodata.pipeline.columns:

class CorpusColumn(Column)
 |  CorpusColumn(nature, column_names, stopwords=None, english=True, nb_grams=4, min_df=1, max_df=1.0, vocab_sample=None, max_features=None, strip_accents='unicode', lowercase=True, min_word_length=5, normalize=True, vocabulary=None)
 |
 |  A column class that represents an entity specified by multiple text
 |  columns in the dataset.
 |
 |  For `CorpusColumn`, the entity matrix rows correspond to the rows of
 |  IdentityColumn. For each n-gram extracted from combined text of specified
 |  columns, a column is added to the matrix.
 |
 |  Attributes
 |  ----------
 |  nature: str
 |      The nature of the column entity
 |  column_names: list
 |      The list of column names in the dataset
 |  type: Columns
 |      The type of this column. Its value should be Columns.CORPUS
 |  stopwords: str
 |     The name of the stopwords file, or the URL for file that starts with
 |  http. It should be located under input_dir of the dataset.
 |  english: bool
 |      If True it will get union of the specified stopwords file with
 |  `sklearn.feature_extraction.text.ENGLISH_STOP_WORDS`.
 |  nb_grams: int
 |      Maximum n for n-grams.
 |  min_df: float in range [0.0, 1.0] or int
 |      When building the vocabulary ignore terms that have a document
 |  frequency strictly lower than the given threshold. This value is also
 |  called cut-off in the literature. If float, the parameter represents a
 |  proportion of documents, integer absolute counts. This parameter is
 |  ignored if vocabulary is not None.
 |  max_df: float in range [0.0, 1.0] or int
 |      When building the vocabulary ignore terms that have a document
 |  frequency strictly higher than the given threshold (corpus-specific
 |  stopwords). If float, the parameter represents a proportion of documents,
 |  integer absolute counts. This parameter is ignored if vocabulary is not
 |  None.
 |  vocab_sample: int
 |      Sample size from the corpus to train the vectorizer
 |  max_features:int
 |      If not None, build a vocabulary that only consider the top
 |  max_features ordered by term frequency across the corpus. Otherwise, all
 |  features are used.
 |  strip_accents: {‘ascii’, ‘unicode’} or callable
 |  lowercase: bool
 |      Convert all characters to lowercase before tokenizing
 |  min_word_length: int
 |      Minimum word length
 |  normalize: bool
 |      Normalizes the returned matrix if set to True
 |  vocabulary: str or set
 |      Either a set of terms or a file name that contains the terms one in
 |  each line, or the URL for file that starts with http.
 |
 |  Methods
 |  -------
 |  load()
 |      Loads and processes the column specified by `column_name` in the
 |  dataset and generates the entity and scores matrices.
 |  get_corpus(df)
 |      Creates a text column merging the values of the columns putting
 |  " . " between column values specified for this column.
 |
 |
 |  Corpus column uses custom
 |  `cartodata.loading.PunctuationCountVectorizer` which splits the
 |  document on "[.,;]" and builds n-grams from each segment separately.
 |
 |  Method resolution order:
 |      CorpusColumn
 |      Column
 |      cartodata.pipeline.base.BaseEntity
 |      builtins.object
 |
 |  Methods defined here:
 |
 |  __init__(self, nature, column_names, stopwords=None, english=True, nb_grams=4, min_df=1, max_df=1.0, vocab_sample=None, max_features=None, strip_accents='unicode', lowercase=True, min_word_length=5, normalize=True, vocabulary=None)
 |      Parameters
 |      ----------
 |      nature: str
 |          The nature of the column entity
 |      column_names: list
 |          The list of column names in the dataset
 |      type: Columns
 |          The type of this column. Its value should be Columns.CORPUS
 |      stopwords: str, default=None
 |         The name of the stopwords file, or the URL for file that starts with
 |      http. It should be located under input_dir of the dataset.
 |      english: bool, default=True
 |          If True it will get union of the specified stopwords file with
 |      `sklearn.feature_extraction.text.ENGLISH_STOP_WORDS`.
 |      nb_grams: int, default=4
 |          Maximum n for n-grams.
 |      min_df: float in range [0.0, 1.0] or int, default=1
 |          When building the vocabulary ignore terms that have a document
 |      frequency strictly lower than the given threshold. This value is also
 |      called cut-off in the literature. If float, the parameter represents a
 |      proportion of documents, integer absolute counts. This parameter is
 |      ignored if vocabulary is not None.
 |      max_df: float in range [0.0, 1.0] or int, default=1.0
 |          When building the vocabulary ignore terms that have a document
 |      frequency strictly higher than the given threshold (corpus-specific
 |      stopwords). If float, the parameter represents a proportion of
 |      documents, integer absolute counts. This parameter is ignored if
 |      vocabulary is not None.
 |      vocab_sample: int, default=None
 |          Sample size from the corpus to train the vectorizer
 |      max_features:int, default=None
 |          If not None, build a vocabulary that only consider the top
 |      max_features ordered by term frequency across the corpus. Otherwise,
 |      all features are used. If vocabulary is not None, this parameter is
 |      ignored.
 |      strip_accents: {‘ascii’, ‘unicode’} or callable, default=unicode
 |      lowercase: bool, default=True
 |          Convert all characters to lowercase before tokenizing
 |      min_word_length: int, default 5
 |          Minimum word length
 |      normalize: bool, default=True
 |          Normalizes the returned matrix if set to True
 |      vocabulary: str or set, default=None
 |          Either a set of terms or a file name that contains the terms one in
 |      each line, or the URL for file that starts with http.
 |
 |  get_corpus(self, df)
 |      Creates a text column merging the values of the columns putting
 |      " . " between column values specified for this column.
 |
 |      Returns
 |      -------
 |      pandas.Series
 |          a Series that contains merged text from the columns in CospusColumn
 |
 |  load(self, dataset)
 |      Loads and processes the column specified by `column_name` in the
 |      dataset and generates the entity and scores matrices.
 |
 |      Parameters
 |      ----------
 |      dataset: cartodata.pipeline.datasets.Dataset
 |          The dataset from which the column will be loaded
 |
 |      Returns
 |      -------
 |          A sparse matrix and a pandas series
 |
 |  ----------------------------------------------------------------------
 |  Data and other attributes defined here:
 |
 |  yaml_tag = '!CorpusColumn'
 |
 |  ----------------------------------------------------------------------
 |  Readonly properties inherited from Column:
 |
 |  params
 |      Returns the parameter values of this estimator as string
 |      contatenated with ``_`` character.
 |
 |      Returns
 |      -------
 |      str
 |
 |  ----------------------------------------------------------------------
 |  Readonly properties inherited from cartodata.pipeline.base.BaseEntity:
 |
 |  params_values
 |      Returns all parameter name-parameter value pairs as a dictionary.
 |
 |      Returns
 |      -------
 |      dict
 |
 |  phase
 |
 |  ----------------------------------------------------------------------
 |  Data descriptors inherited from cartodata.pipeline.base.BaseEntity:
 |
 |  __dict__
 |      dictionary for instance variables (if defined)
 |
 |  __weakref__
 |      list of weak references to the object (if defined)

CorpusColumn uses custom PunctuationCountVectorizer to build the n-grams. Let’s say instead of using PunctuationCountVectorizer we would like to use sklearn.feature_extraction.TfidfVectorizer.

We will define a custom column class as follows:

import pandas as pd  # noqa
import numpy as np  # noqa
from sklearn.feature_extraction.text import TfidfVectorizer  # noqa

from cartodata.operations import normalize_tfidf  # noqa
from cartodata.pipeline.columns import CorpusColumn  # noqa


class TfidfCorpusColumn(CorpusColumn):

    def __init__(self, nature, column_names, stopwords, english=True,
                 nb_grams=4, min_df=10, max_df=0.05, vocab_sample=None,
                 max_features=None, strip_accents="unicode",
                 lowercase=True, min_word_length=5, normalize=True,
                 input='content', binary=True, encoding="utf-8"):

        super().__init__(nature, column_names, stopwords, english, nb_grams,
                         min_df, max_df, vocab_sample, max_features,
                         strip_accents, lowercase, min_word_length, normalize)
        self.binary = binary
        self.encoding = encoding
        self.input = input

    def load(self, dataset):

        df = dataset.df
        stopwords = self._load_stopwords(
            dataset.input_dir, self.stopwords, self.english, dataset.name
        )
        corpus = df[self.column_names].apply(
            lambda row: ' . '.join(row.values.astype(str)), axis=1)

        tf_vectorizer = TfidfVectorizer(
            input=self.input, ngram_range=(1, self.nb_grams),
            min_df=self.min_df, max_df=self.max_df, binary=True,
            strip_accents=self.strip_accents, encoding=self.encoding,
            stop_words=stopwords)

        matrix = tf_vectorizer.fit_transform(corpus)

        scores = pd.Series(np.bincount(matrix.indices,
                                       minlength=matrix.shape[1]),
                           index=tf_vectorizer.get_feature_names_out())
        if self.normalize:
            matrix = normalize_tfidf(matrix)

        return matrix, scores

dataset_columns contains a shallow copy of the columns of the dataset. Now we will initiate a new words_column to replace the existing one.

words_column = TfidfCorpusColumn(nature="words",
                                 column_names=["en_abstract_s", "en_title_s",
                                               "en_keyword_s", "en_domainAllCodeLabel_fs"],
                                 stopwords="stopwords.txt", nb_grams=4, min_df=10,
                                 max_df=0.05, min_word_length=5, normalize=True)

dataset_columns[4] = words_column
dataset_columns

""
pipeline.working_dir

PosixPath('/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1')

We see that the type of the words column has changed. Now we have to set it to the dataset (as the dataset_columns is a shallow copy) and verify pipeline’s natures and dataset’s columns.

pipeline.dataset.set_columns(dataset_columns)

print(pipeline.natures)

""
print(pipeline.dataset.columns)

['articles', 'authors', 'teams', 'labs', 'words']
[<cartodata.pipeline.columns.IdentityColumn object at 0x7efd55925430>, <cartodata.pipeline.columns.CSColumn object at 0x7efc85c55700>, <cartodata.pipeline.columns.CSColumn object at 0x7efc85c55730>, <cartodata.pipeline.columns.CSColumn object at 0x7efc85c559d0>, <__main__.TfidfCorpusColumn object at 0x7efc78a47e80>]

In the pipeline.working_dir we have the dumps from the previous run.

for file in pipeline.working_dir.iterdir():
    print(file)

/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/nearest_teams_for_articles.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/words_mat.npz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/words.index
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/articles_lsa.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/ll_clusters_umap.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/ml_clusters_scores.csv.bz2
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/teams_scores.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/vll_clusters_eval_negscores.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/vll_clusters_umap.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/nearest_words_for_labs.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/authors_lsa.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/nearest_labs_for_authors.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/hl_clusters_scores.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/words_umap.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/words.index.dat
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/ll_clusters_labels.json
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/nearest_articles_for_labs.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/ll_clusters_scores.csv.bz2
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/hl_clusters_lsa.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/vll_clusters_labels.json
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/nearest_authors_for_labs.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/hl_clusters_eval_posscores.csv.bz2
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/hl_clusters_eval_negscores.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/nearest_authors_for_words.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/nearest_authors_for_authors.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/labs_lsa.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/vll_clusters_scores.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/nearest_articles_for_teams.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/labs_scores.csv.bz2
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/ml_clusters_labels.json
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/ml_clusters_eval_posscores.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/nearest_teams_for_words.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/ll_clusters_scores.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/nearest_labs_for_teams.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/ml_clusters_eval_negscores.csv.bz2
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/articles.index
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/teams_scores.csv.bz2
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/nearest_authors_for_teams.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/authors_mat.npz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/hl_clusters_umap.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/words_scores.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/lisn_2022.11.15.1_15d3d1d060e5f28d_lsa_200_True_umap_euclidean_20_0.1_random_1.0_None_None.png
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/ml_clusters_eval_negscores.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/lisn_2022.11.15.1_15d3d1d060e5f28d_lsa_200_True_umap_euclidean_20_0.1_random_1.0_None_None_kmeans_ll_clusters.png
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/words_lsa.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/teams_umap.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/nearest_labs_for_words.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/lisn_2022.11.15.1_15d3d1d060e5f28d_lsa_200_True_umap_euclidean_20_0.1_random_1.0_None_None_kmeans_vll_clusters.png
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/nearest_authors_for_articles.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/labs_umap.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/ll_clusters_eval_posscores.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/ml_clusters_lsa.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/articles.index.dat
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/labs_scores.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/authors_scores.csv.bz2
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/vll_clusters_lsa.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/ll_clusters_eval_posscores.csv.bz2
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/authors_umap.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/articles_scores.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/vll_clusters_eval_posscores.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/teams_mat.npz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/vll_clusters_eval_posscores.csv.bz2
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/nearest_articles_for_articles.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/hl_clusters_labels.json
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/hl_clusters_eval_posscores.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/lisn_2022.11.15.1_15d3d1d060e5f28d_lsa_200_True_umap_euclidean_20_0.1_random_1.0_None_None_kmeans_ml_clusters.png
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/labs.index
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/lsa_components.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/teams.index
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/nearest_articles_for_words.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/nearest_labs_for_articles.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/nearest_articles_for_authors.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/nearest_words_for_teams.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/hl_clusters_eval_negscores.csv.bz2
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/nearest_teams_for_authors.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/lisn_2022.11.15.1_15d3d1d060e5f28d_lsa_200_True_umap_euclidean_20_0.1_random_1.0_None_None_kmeans_hl_clusters.png
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/authors.index
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/teams_lsa.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/articles_mat.npz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/hl_clusters_scores.csv.bz2
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/ml_clusters_eval_posscores.csv.bz2
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/nearest_words_for_words.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/words_scores.csv.bz2
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/teams.index.dat
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/labs_mat.npz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/nearest_teams_for_labs.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/ll_clusters_eval_negscores.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/ml_clusters_scores.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/articles_scores.csv.bz2
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/vll_clusters_eval_negscores.csv.bz2
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/authors_scores.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/ll_clusters_eval_negscores.csv.bz2
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/authors.index.dat
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/nearest_words_for_articles.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/nearest_words_for_authors.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/labs.index.dat
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/nearest_teams_for_teams.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/ll_clusters_lsa.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/nearest_labs_for_labs.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/articles_umap.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/ml_clusters_umap.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/export.feather
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/vll_clusters_scores.csv.bz2

We do not want to use the dumps from previous run. We can either remove the files or run pipeline.generate_entity_matrices with force=True.

We will generate the entity matrices with the new words column and save:

matrices, scores = pipeline.generate_entity_matrices(force=True)

/usr/local/lib/python3.9/site-packages/sklearn/feature_extraction/text.py:406: UserWarning: Your stop_words may be inconsistent with your preprocessing. Tokenizing the stop words generated tokens ['ii', 'iii'] not in stop_words.
  warnings.warn(

In the pipeline.yaml file, the pipeline is configured to use cartodata.pipeline.projectionnd.LSAProjection.

pipeline.projection_nd

<cartodata.pipeline.projectionnd.LSAProjection object at 0x7efc94e6ca00>

Instead of LSAProjection, this time we would like to use cartodata.pipeline.projectionnd.LDAProjection.

from cartodata.pipeline.projectionnd import LDAProjection  # noqa

projection_nd = LDAProjection(num_dims=80)
pipeline.set_projection_nd(projection_nd)

matrices_nD = pipeline.do_projection_nD(force=True)

""
for file in pipeline.working_dir.glob("*.lda"):
    print(file)

Then we will project to 2D. We should not forget to set force=True to make sure that 2D projection is rerun and does not use the saved dumps.

matrices_2D = pipeline.do_projection_2D(force=True)

View the generated map:

from matplotlib.colors import TABLEAU_COLORS  # noqa
from matplotlib.lines import Line2D  # noqa

labels = tuple(pipeline.natures)
colors = list(TABLEAU_COLORS)[:len(matrices)]

pipeline.plot_map(matrices_2D, labels, colors)

(<Figure size 960x640 with 1 Axes>, <Axes: >)

Now we have all necessary matrices to create clusters and neighbors. First we will create clusters:

(clus_nD, clus_2D, clus_scores, cluster_labels,
cluster_eval_pos, cluster_eval_neg) = pipeline.do_clustering(force=True)

Warning: Less than 2 words in cluster 35 with (1) words.
Warning: Less than 2 words in cluster 43 with (0) words.
Warning: Less than 2 words in cluster 43 with (0) words.
Warning: Less than 2 words in cluster 52 with (1) words.
Warning: Less than 2 words in cluster 52 with (1) words.
Warning: Less than 2 words in cluster 58 with (1) words.
Warning: Less than 2 words in cluster 58 with (1) words.
Warning: Less than 2 words in cluster 70 with (0) words.
Warning: Less than 2 words in cluster 74 with (0) words.
Warning: Less than 2 words in cluster 74 with (0) words.
Warning: Less than 2 words in cluster 78 with (1) words.
Warning: Less than 2 words in cluster 78 with (1) words.
Warning: Less than 2 words in cluster 86 with (0) words.
Warning: Less than 2 words in cluster 86 with (0) words.
Warning: Less than 2 words in cluster 108 with (1) words.
Warning: Less than 2 words in cluster 108 with (1) words.
Warning: Less than 2 words in cluster 112 with (0) words.
Warning: Less than 2 words in cluster 112 with (0) words.
Warning: Less than 2 words in cluster 154 with (1) words.
Warning: Less than 2 words in cluster 154 with (1) words.
Warning: Less than 2 words in cluster 193 with (1) words.
Warning: Less than 2 words in cluster 193 with (1) words.
Warning: Less than 2 words in cluster 213 with (0) words.
Warning: Less than 2 words in cluster 213 with (0) words.

We will view only medium levels clusters:

ml_index = 1
clus_scores_ml = clus_scores[ml_index]
clus_mat_ml = clus_2D[ml_index]

fig_title = (
    f"{pipeline.dataset.name} {pipeline.dataset.version} "
    f"{pipeline.clustering.natures[ml_index]} {pipeline.projection_nd.key}"
)

fig_ml, ax_ml = pipeline.plot_map(matrices_2D, labels, colors,
                                  title=fig_title,
                                  annotations=clus_scores_ml.index,
                                  annotation_mat=clus_mat_ml)

And save the plot:

pipeline.save_plots()

[(<Figure size 960x640 with 1 Axes>, 'lisn_2022.11.15.1_15d3d1d060e5f28d_lda_80_True_0_20_umap_euclidean_20_0.1_random_1.0_None_None_kmeans_hl_clusters.png'), (<Figure size 960x640 with 1 Axes>, 'lisn_2022.11.15.1_15d3d1d060e5f28d_lda_80_True_0_20_umap_euclidean_20_0.1_random_1.0_None_None_kmeans_ml_clusters.png'), (<Figure size 960x640 with 1 Axes>, 'lisn_2022.11.15.1_15d3d1d060e5f28d_lda_80_True_0_20_umap_euclidean_20_0.1_random_1.0_None_None_kmeans_ll_clusters.png'), (<Figure size 960x640 with 1 Axes>, 'lisn_2022.11.15.1_15d3d1d060e5f28d_lda_80_True_0_20_umap_euclidean_20_0.1_random_1.0_None_None_kmeans_vll_clusters.png')]

Let’s display two cluster images generated by two runs of the pipeline side by side:

image_title_parts = pipeline.title_parts_clus("ml_clusters")
img1 = plt.imread(pipeline.working_dir / image_names[1])
img2 = plt.imread(pipeline.working_dir / ("_".join(image_title_parts) + ".png"))

f, ax = plt.subplots(1, 2, figsize=(20, 10))

ax[0].imshow(img1)
ax[1].imshow(img2)

ax[0].axis('off')
ax[1].axis('off')

plt.tight_layout()
plt.show()

On the left we see the plot of medium level clusters using LSA and UMAP, and on the right we see the results using LDA and UMAP.

Then we will find neighbors:

pipeline.find_neighbors()

Now we are ready to export data to export.feather file.

pipeline.export()

df = pd.read_feather(pipeline.working_dir / "export.feather")
df.head()

	nature	label	score	rank	x	y	nn_articles	nn_authors	nn_teams	nn_labs	nn_words	labs	authors	year	url
0	articles	Termination and Confluence of Higher-Order Rew...	1.0	0	4.364014	10.631992	0,30,662,13,17,15,432,4,1996,2049,2909,3270,20...	4290,4263,4323,4305,4329,4368,4273,4280,4543,4...	4956,4964,4963,4974,4982,4969,4960,4977,4966,4...	5134,5115,5384,5014,5081,5036,5109,5296,5515,5...	7442,8525,7440,9211,9948,8627,9793,7591,6012,9...	4992,4991,4990,4989,		2000	inria-00105556
1	articles	Efficient Self-stabilization	1.0	1	5.481778	8.295186	1,60,358,22,878,233,409,1941,198,893,7,212,34,...	4262,4273,4271,4303,4385,4282,4268,4408,4538,4...	4956,4964,4963,4974,4982,4957,4969,4977,4960,4...	5024,5110,5103,5012,5031,5425,5185,5194,5174,5...	9560,9561,9897,6756,6757,9972,9144,8556,7655,6...	4992,4991,4990,4989,	4262	2000	tel-00124843
2	articles	Resource-bounded relational reasoning: inducti...	1.0	2	5.520878	8.531384	2,134,196,11,2615,581,2198,289,286,1463,2001,2...	4263,4329,4345,4368,4273,4411,4543,4731,4310,4...	4956,4964,4963,4974,4982,4969,4960,4977,4966,4...	5070,5116,5056,5225,5349,5384,5081,5141,5458,5...	7926,9072,7601,8839,7591,8980,8412,7590,8573,7...	4992,4990,4989,4991,4994,4993,	4263	2000	hal-00111312
3	articles	Reasoning about generalized intervals : Horn r...	1.0	3	2.178202	3.326003	3,98,2163,274,3016,1505,2995,3278,3064,2306,15...	4305,4323,4278,4273,4411,4340,4374,4322,4308,4...	4956,4964,4963,4974,4969,4977,4966,4960,4962,4...	5116,5003,5373,5117,5317,5004,5338,5292,5258,5...	7733,7321,5651,6461,6209,9868,6163,6426,8631,6...	4992,5005,5004,5003,5002,5001,5000,4999,4998,4...		2000	hal-03300321
4	articles	Proof Nets and Explicit Substitutions	1.0	4	3.025347	11.557344	4,15,17,2909,2049,2545,30,1376,1545,3291,41,14...	4314,4334,4291,4278,4368,4480,4433,4371,4310,4...	4956,4964,4963,4974,4982,4969,4977,4960,4958,4...	5134,5384,5292,5317,5296,5115,5155,5006,5007,5...	6012,7807,9948,8627,9211,6376,9793,8280,9951,7...	4992,4990,4989,4991,5008,4994,5007,5006,		2000	hal-00384955

It is also possible to change the export configuration.

export_natures = pipeline.export_natures

The pipeline.yaml file defines export configuration for only articles and authors.

for nature in export_natures:
    print(nature.key)

articles
authors

Let’s say we would like to add year data to labs nature.

The original dataset contains a column producedDateY_i which contains the year that the article is published. We can add this data as metadata for the point but updating column name with a more clear alternative year.

We can map year data in the file to labs using EntityMetadataMapColumn.

from cartodata.pipeline.exporting import EntityMetadataMapColumn # noqa

meta_year_lab = EntityMetadataMapColumn(entity="labs", column="producedDateY_i",
                                        as_column="year")

Then we will initialize ExportNature for labs using meta_year_lab.

from cartodata.pipeline.exporting import ExportNature # noqa

ex_lab = ExportNature(key="labs",
                         add_metadata=[meta_year_lab])

new_export_natures = export_natures + [ex_lab]

""
pipeline.export(new_export_natures)

""
df = pd.read_feather(pipeline.working_dir/ "export.feather")

df[df["nature"] == "labs"].head(5)

	nature	label	score	rank	x	y	nn_articles	nn_authors	nn_teams	nn_labs	nn_words	labs	authors	year	url
4989	labs	LRI	4789.0	4989	4.909530	4.706197	655,1142,412,2066,2546,298,1816,3562,1278,864,...	4263,4290,4293,4341,4554,4305,4550,4323,4329,4...	4956,4964,4963,4974,4982,4969,4977,4960,4966,4...	4989,4990,4991,5008,4992,5062,5107,4993,5060,4...	9137,9157,6197,8866,9133,7894,6817,8438,6067,7...	None	None	2000,2001,2002,2003,2004,2005,2006,2007,2008,2...	None
4990	labs	UP11	6271.0	4990	4.696403	4.794846	655,1141,298,1142,2298,2387,1165,2098,1144,181...	4293,4263,4290,4341,4554,4305,4329,4550,4314,4...	4956,4964,4963,4974,4982,4969,4977,4960,4962,4...	4990,4989,4991,5008,4992,4993,5062,5030,5012,5...	9137,9157,6197,7894,8438,9133,7957,7542,6363,5...	None	None	2000,2001,2002,2003,2004,2005,2006,2007,2008,2...	None
4991	labs	CNRS	10217.0	4991	4.986214	4.539379	3179,2387,2082,1032,1816,551,3744,607,3073,107...	4263,4290,4293,4341,4550,4305,4706,4329,4323,4...	4956,4964,4963,4974,4982,4969,4977,4960,4962,4...	4991,4989,4990,4992,5008,5062,4994,4993,5107,5...	9157,8656,9137,6197,7953,8438,6661,7894,7531,5...	None	None	2000,2001,2002,2003,2004,2005,2006,2007,2008,2...	None
4992	labs	LISN	5203.0	4992	5.225499	3.964251	4154,3642,3973,3848,4142,4025,3447,4221,3649,3...	4290,4263,4293,4550,4554,4341,4706,4323,4305,4...	4956,4964,4963,4974,4982,4969,4977,4960,4966,4...	4992,4991,4989,5008,4990,5059,5005,5151,5062,4...	7531,8656,6067,6661,5592,9157,5593,7894,7913,5...	None	None	2000,2001,2002,2003,2004,2005,2006,2007,2008,2...	None
4993	labs	X	487.0	4993	1.629990	2.366460	1570,837,834,1307,45,2325,1398,1226,2434,2298,...	4292,4281,4263,4572,4314,4330,4661,4463,4424,4...	4963,4956,4964,4974,4982,4969,4977,4962,4960,4...	4993,5030,5170,5421,5382,5200,5424,5410,5165,5...	8897,8896,9318,9316,5934,9317,8296,8874,8295,7...	None	None	2000,2001,2002,2003,2004,2005,2006,2007,2008,2...	None

Total running time of the script: (3 minutes 30.250 seconds)

Gallery generated by Sphinx-Gallery