Note
Go to the end to download the full example code.
Using Pipeline API with YAML file¶
This notebook demonstrates the usage of Pipeline API by loading pipeline from pipeline.yaml file.
For creating a pipeline using Pipeline constructor, please see the notebook [pipeline_lisn_lsa_kmeans.py](./pipeline_lisn_lsa_kmeans.py).
YAML files that define configuration for the Pipeline are located in the ../conf directory. For each dataset, there exists a directory under the conf directory, and in the dataset directory, we have dataset.yaml and pipeline.yaml files.
We can list all the datasets where a pipeline is defined:
from pathlib import Path
ROOT_DIR = Path.cwd().parent
CONF_DIR = ROOT_DIR / "conf"
""
# !ls $CONF_DIR
''
We will work with lisn dataset file.
# !ls $CONF_DIR/lisn
Here we see both a dataset.yaml and pipeline.yaml. dataset.yaml is used by pipeline.yaml to initialize lisn dataset. Let’s see the contents of this file:
# !cat $CONF_DIR/lisn/dataset.yaml
Let’s view the contents of the pipeline.yaml file:
# !cat $CONF_DIR/lisn/pipeline.yaml
Load Pipeline from YAML file¶
We can load this pipeline using cartodata.pipeline.loader.load_pipeline.
We will run this in the project’s root directory (parent directory of the examples directory) as the paths in the pipeline.yaml file are relative to the project’s root directory.
from cartodata.pipeline.loader import load_pipeline # noqa
import os # noqa
from pathlib import Path # noqa
# run in parent directory
os.chdir("../")
current_dir = Path(".").absolute()
conf_dir = current_dir / "conf"
pipeline = load_pipeline("lisn", conf_dir)
We can simply run the pipeline with the configuration specified in pipeline.yaml file.
We do not want to use the dump files from a previous run, so we set force=True and to see the results of the run, we would like to save the plots.
pipeline.run(save_plots=True, force=True)
Downloading data from https://zenodo.org/records/7323538/files/lisn_2000_2022.csv (6.3 MB)
file_sizes: 0%| | 0.00/6.59M [00:00<?, ?B/s]
file_sizes: 45%|███████████▋ | 2.97M/6.59M [00:00<00:00, 28.9MB/s]
file_sizes: 100%|██████████████████████████| 6.59M/6.59M [00:00<00:00, 47.8MB/s]
Successfully downloaded file to dumps/lisn/2022.11.15.1/lisn_2000_2022.csv
This runs the following steps:
loads the dataset data from file
generates entity matrices for all natures and saves them
executes projection and saves the results
executes 2D projection and saves the results
creates clusters
finds neighbors
saves all data to export.feather file
The results are saved under pipeline.working_dir.
for file in pipeline.working_dir.iterdir():
print(file)
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/nearest_teams_for_articles.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/words_mat.npz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/words.index
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/articles_lsa.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/ll_clusters_umap.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/ml_clusters_scores.csv.bz2
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/teams_scores.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/vll_clusters_eval_negscores.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/vll_clusters_umap.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/nearest_words_for_labs.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/authors_lsa.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/nearest_labs_for_authors.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/hl_clusters_scores.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/words_umap.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/words.index.dat
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/ll_clusters_labels.json
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/nearest_articles_for_labs.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/ll_clusters_scores.csv.bz2
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/hl_clusters_lsa.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/vll_clusters_labels.json
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/nearest_authors_for_labs.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/hl_clusters_eval_posscores.csv.bz2
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/hl_clusters_eval_negscores.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/nearest_authors_for_words.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/nearest_authors_for_authors.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/labs_lsa.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/vll_clusters_scores.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/nearest_articles_for_teams.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/labs_scores.csv.bz2
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/ml_clusters_labels.json
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/ml_clusters_eval_posscores.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/nearest_teams_for_words.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/ll_clusters_scores.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/nearest_labs_for_teams.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/ml_clusters_eval_negscores.csv.bz2
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/articles.index
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/teams_scores.csv.bz2
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/nearest_authors_for_teams.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/authors_mat.npz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/hl_clusters_umap.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/words_scores.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/lisn_2022.11.15.1_15d3d1d060e5f28d_lsa_200_True_umap_euclidean_20_0.1_random_1.0_None_None.png
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/ml_clusters_eval_negscores.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/lisn_2022.11.15.1_15d3d1d060e5f28d_lsa_200_True_umap_euclidean_20_0.1_random_1.0_None_None_kmeans_ll_clusters.png
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/words_lsa.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/teams_umap.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/nearest_labs_for_words.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/lisn_2022.11.15.1_15d3d1d060e5f28d_lsa_200_True_umap_euclidean_20_0.1_random_1.0_None_None_kmeans_vll_clusters.png
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/nearest_authors_for_articles.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/labs_umap.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/ll_clusters_eval_posscores.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/ml_clusters_lsa.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/articles.index.dat
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/labs_scores.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/authors_scores.csv.bz2
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/vll_clusters_lsa.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/ll_clusters_eval_posscores.csv.bz2
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/authors_umap.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/articles_scores.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/vll_clusters_eval_posscores.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/teams_mat.npz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/vll_clusters_eval_posscores.csv.bz2
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/nearest_articles_for_articles.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/hl_clusters_labels.json
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/hl_clusters_eval_posscores.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/lisn_2022.11.15.1_15d3d1d060e5f28d_lsa_200_True_umap_euclidean_20_0.1_random_1.0_None_None_kmeans_ml_clusters.png
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/labs.index
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/lsa_components.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/teams.index
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/nearest_articles_for_words.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/nearest_labs_for_articles.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/nearest_articles_for_authors.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/nearest_words_for_teams.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/hl_clusters_eval_negscores.csv.bz2
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/nearest_teams_for_authors.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/lisn_2022.11.15.1_15d3d1d060e5f28d_lsa_200_True_umap_euclidean_20_0.1_random_1.0_None_None_kmeans_hl_clusters.png
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/authors.index
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/teams_lsa.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/articles_mat.npz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/hl_clusters_scores.csv.bz2
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/ml_clusters_eval_posscores.csv.bz2
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/nearest_words_for_words.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/words_scores.csv.bz2
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/teams.index.dat
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/labs_mat.npz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/nearest_teams_for_labs.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/ll_clusters_eval_negscores.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/ml_clusters_scores.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/articles_scores.csv.bz2
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/vll_clusters_eval_negscores.csv.bz2
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/authors_scores.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/ll_clusters_eval_negscores.csv.bz2
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/authors.index.dat
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/nearest_words_for_articles.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/nearest_words_for_authors.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/labs.index.dat
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/nearest_teams_for_teams.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/ll_clusters_lsa.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/nearest_labs_for_labs.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/articles_umap.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/ml_clusters_umap.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/export.feather
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/vll_clusters_scores.csv.bz2
Let’s view the export.feather file:
import pandas as pd # noqa
df = pd.read_feather(pipeline.working_dir / "export.feather")
df.head()
As we have saved the plots, we can view them.
We have 4 level clusters as defined in pipeline.yaml file corresponding to
hl_clusters,
ml_clusters,
ll_clusters,
vll_clusters.
for file in pipeline.working_dir.glob("*.png"):
print(file)
""
import matplotlib.pyplot as plt # noqa
image_names = []
for nature in pipeline.clustering.natures:
image_title_parts = pipeline.title_parts_clus(nature)
image_names.append("_".join(image_title_parts) + ".png")
rows = 2
columns = 2
f, ax = plt.subplots(rows, columns, figsize=(30, 10*rows))
for i, image_name in enumerate(image_names):
img = plt.imread(pipeline.working_dir / image_name)
row_num = i // rows
col_num = i - row_num * columns
ax[row_num][col_num].imshow(img)
ax[row_num][col_num].axis('off')
plt.tight_layout()
plt.show()

/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/lisn_2022.11.15.1_15d3d1d060e5f28d_lsa_200_True_umap_euclidean_20_0.1_random_1.0_None_None.png
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/lisn_2022.11.15.1_15d3d1d060e5f28d_lsa_200_True_umap_euclidean_20_0.1_random_1.0_None_None_kmeans_ll_clusters.png
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/lisn_2022.11.15.1_15d3d1d060e5f28d_lsa_200_True_umap_euclidean_20_0.1_random_1.0_None_None_kmeans_vll_clusters.png
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/lisn_2022.11.15.1_15d3d1d060e5f28d_lsa_200_True_umap_euclidean_20_0.1_random_1.0_None_None_kmeans_ml_clusters.png
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/lisn_2022.11.15.1_15d3d1d060e5f28d_lsa_200_True_umap_euclidean_20_0.1_random_1.0_None_None_kmeans_hl_clusters.png
We have run the pipeline with the configuration values in the pipeline.yaml file.
Make modifications to Pipeline loaded from YAML¶
We can make modifications on the pipeline instance and rerun the pipeline using pipeline.run or alternatively we can run each step individually.
In the rest of this notebook, we will run each step ourselves changing some parameters.
The current pipeline has natures:
pipeline.natures
['articles', 'authors', 'teams', 'labs', 'words']
In dataset.yaml we can see that words column uses CorpusColumn. Let’s verify:
dataset_columns = pipeline.dataset.columns
dataset_columns
[<cartodata.pipeline.columns.IdentityColumn object at 0x7efd55925430>, <cartodata.pipeline.columns.CSColumn object at 0x7efc85c55700>, <cartodata.pipeline.columns.CSColumn object at 0x7efc85c55730>, <cartodata.pipeline.columns.CSColumn object at 0x7efc85c559d0>, <cartodata.pipeline.columns.CorpusColumn object at 0x7efc85c55460>]
The third column of the dataset is of type cartodata.pipeline.columns.CorpusColumn. We will view the documentation for CorpusColumn.
from cartodata.pipeline.columns import CorpusColumn # noqa
help(CorpusColumn)
Help on class CorpusColumn in module cartodata.pipeline.columns:
class CorpusColumn(Column)
| CorpusColumn(nature, column_names, stopwords=None, english=True, nb_grams=4, min_df=1, max_df=1.0, vocab_sample=None, max_features=None, strip_accents='unicode', lowercase=True, min_word_length=5, normalize=True, vocabulary=None)
|
| A column class that represents an entity specified by multiple text
| columns in the dataset.
|
| For `CorpusColumn`, the entity matrix rows correspond to the rows of
| IdentityColumn. For each n-gram extracted from combined text of specified
| columns, a column is added to the matrix.
|
| Attributes
| ----------
| nature: str
| The nature of the column entity
| column_names: list
| The list of column names in the dataset
| type: Columns
| The type of this column. Its value should be Columns.CORPUS
| stopwords: str
| The name of the stopwords file, or the URL for file that starts with
| http. It should be located under input_dir of the dataset.
| english: bool
| If True it will get union of the specified stopwords file with
| `sklearn.feature_extraction.text.ENGLISH_STOP_WORDS`.
| nb_grams: int
| Maximum n for n-grams.
| min_df: float in range [0.0, 1.0] or int
| When building the vocabulary ignore terms that have a document
| frequency strictly lower than the given threshold. This value is also
| called cut-off in the literature. If float, the parameter represents a
| proportion of documents, integer absolute counts. This parameter is
| ignored if vocabulary is not None.
| max_df: float in range [0.0, 1.0] or int
| When building the vocabulary ignore terms that have a document
| frequency strictly higher than the given threshold (corpus-specific
| stopwords). If float, the parameter represents a proportion of documents,
| integer absolute counts. This parameter is ignored if vocabulary is not
| None.
| vocab_sample: int
| Sample size from the corpus to train the vectorizer
| max_features:int
| If not None, build a vocabulary that only consider the top
| max_features ordered by term frequency across the corpus. Otherwise, all
| features are used.
| strip_accents: {‘ascii’, ‘unicode’} or callable
| lowercase: bool
| Convert all characters to lowercase before tokenizing
| min_word_length: int
| Minimum word length
| normalize: bool
| Normalizes the returned matrix if set to True
| vocabulary: str or set
| Either a set of terms or a file name that contains the terms one in
| each line, or the URL for file that starts with http.
|
| Methods
| -------
| load()
| Loads and processes the column specified by `column_name` in the
| dataset and generates the entity and scores matrices.
| get_corpus(df)
| Creates a text column merging the values of the columns putting
| " . " between column values specified for this column.
|
|
| Corpus column uses custom
| `cartodata.loading.PunctuationCountVectorizer` which splits the
| document on "[.,;]" and builds n-grams from each segment separately.
|
| Method resolution order:
| CorpusColumn
| Column
| cartodata.pipeline.base.BaseEntity
| builtins.object
|
| Methods defined here:
|
| __init__(self, nature, column_names, stopwords=None, english=True, nb_grams=4, min_df=1, max_df=1.0, vocab_sample=None, max_features=None, strip_accents='unicode', lowercase=True, min_word_length=5, normalize=True, vocabulary=None)
| Parameters
| ----------
| nature: str
| The nature of the column entity
| column_names: list
| The list of column names in the dataset
| type: Columns
| The type of this column. Its value should be Columns.CORPUS
| stopwords: str, default=None
| The name of the stopwords file, or the URL for file that starts with
| http. It should be located under input_dir of the dataset.
| english: bool, default=True
| If True it will get union of the specified stopwords file with
| `sklearn.feature_extraction.text.ENGLISH_STOP_WORDS`.
| nb_grams: int, default=4
| Maximum n for n-grams.
| min_df: float in range [0.0, 1.0] or int, default=1
| When building the vocabulary ignore terms that have a document
| frequency strictly lower than the given threshold. This value is also
| called cut-off in the literature. If float, the parameter represents a
| proportion of documents, integer absolute counts. This parameter is
| ignored if vocabulary is not None.
| max_df: float in range [0.0, 1.0] or int, default=1.0
| When building the vocabulary ignore terms that have a document
| frequency strictly higher than the given threshold (corpus-specific
| stopwords). If float, the parameter represents a proportion of
| documents, integer absolute counts. This parameter is ignored if
| vocabulary is not None.
| vocab_sample: int, default=None
| Sample size from the corpus to train the vectorizer
| max_features:int, default=None
| If not None, build a vocabulary that only consider the top
| max_features ordered by term frequency across the corpus. Otherwise,
| all features are used. If vocabulary is not None, this parameter is
| ignored.
| strip_accents: {‘ascii’, ‘unicode’} or callable, default=unicode
| lowercase: bool, default=True
| Convert all characters to lowercase before tokenizing
| min_word_length: int, default 5
| Minimum word length
| normalize: bool, default=True
| Normalizes the returned matrix if set to True
| vocabulary: str or set, default=None
| Either a set of terms or a file name that contains the terms one in
| each line, or the URL for file that starts with http.
|
| get_corpus(self, df)
| Creates a text column merging the values of the columns putting
| " . " between column values specified for this column.
|
| Returns
| -------
| pandas.Series
| a Series that contains merged text from the columns in CospusColumn
|
| load(self, dataset)
| Loads and processes the column specified by `column_name` in the
| dataset and generates the entity and scores matrices.
|
| Parameters
| ----------
| dataset: cartodata.pipeline.datasets.Dataset
| The dataset from which the column will be loaded
|
| Returns
| -------
| A sparse matrix and a pandas series
|
| ----------------------------------------------------------------------
| Data and other attributes defined here:
|
| yaml_tag = '!CorpusColumn'
|
| ----------------------------------------------------------------------
| Readonly properties inherited from Column:
|
| params
| Returns the parameter values of this estimator as string
| contatenated with ``_`` character.
|
| Returns
| -------
| str
|
| ----------------------------------------------------------------------
| Readonly properties inherited from cartodata.pipeline.base.BaseEntity:
|
| params_values
| Returns all parameter name-parameter value pairs as a dictionary.
|
| Returns
| -------
| dict
|
| phase
|
| ----------------------------------------------------------------------
| Data descriptors inherited from cartodata.pipeline.base.BaseEntity:
|
| __dict__
| dictionary for instance variables (if defined)
|
| __weakref__
| list of weak references to the object (if defined)
CorpusColumn uses custom PunctuationCountVectorizer to build the n-grams. Let’s say instead of using PunctuationCountVectorizer we would like to use sklearn.feature_extraction.TfidfVectorizer.
We will define a custom column class as follows:
import pandas as pd # noqa
import numpy as np # noqa
from sklearn.feature_extraction.text import TfidfVectorizer # noqa
from cartodata.operations import normalize_tfidf # noqa
from cartodata.pipeline.columns import CorpusColumn # noqa
class TfidfCorpusColumn(CorpusColumn):
def __init__(self, nature, column_names, stopwords, english=True,
nb_grams=4, min_df=10, max_df=0.05, vocab_sample=None,
max_features=None, strip_accents="unicode",
lowercase=True, min_word_length=5, normalize=True,
input='content', binary=True, encoding="utf-8"):
super().__init__(nature, column_names, stopwords, english, nb_grams,
min_df, max_df, vocab_sample, max_features,
strip_accents, lowercase, min_word_length, normalize)
self.binary = binary
self.encoding = encoding
self.input = input
def load(self, dataset):
df = dataset.df
stopwords = self._load_stopwords(
dataset.input_dir, self.stopwords, self.english, dataset.name
)
corpus = df[self.column_names].apply(
lambda row: ' . '.join(row.values.astype(str)), axis=1)
tf_vectorizer = TfidfVectorizer(
input=self.input, ngram_range=(1, self.nb_grams),
min_df=self.min_df, max_df=self.max_df, binary=True,
strip_accents=self.strip_accents, encoding=self.encoding,
stop_words=stopwords)
matrix = tf_vectorizer.fit_transform(corpus)
scores = pd.Series(np.bincount(matrix.indices,
minlength=matrix.shape[1]),
index=tf_vectorizer.get_feature_names_out())
if self.normalize:
matrix = normalize_tfidf(matrix)
return matrix, scores
dataset_columns contains a shallow copy of the columns of the dataset. Now we will initiate a new words_column to replace the existing one.
words_column = TfidfCorpusColumn(nature="words",
column_names=["en_abstract_s", "en_title_s",
"en_keyword_s", "en_domainAllCodeLabel_fs"],
stopwords="stopwords.txt", nb_grams=4, min_df=10,
max_df=0.05, min_word_length=5, normalize=True)
dataset_columns[4] = words_column
dataset_columns
""
pipeline.working_dir
PosixPath('/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1')
We see that the type of the words column has changed. Now we have to set it to the dataset (as the dataset_columns is a shallow copy) and verify pipeline’s natures and dataset’s columns.
pipeline.dataset.set_columns(dataset_columns)
print(pipeline.natures)
""
print(pipeline.dataset.columns)
['articles', 'authors', 'teams', 'labs', 'words']
[<cartodata.pipeline.columns.IdentityColumn object at 0x7efd55925430>, <cartodata.pipeline.columns.CSColumn object at 0x7efc85c55700>, <cartodata.pipeline.columns.CSColumn object at 0x7efc85c55730>, <cartodata.pipeline.columns.CSColumn object at 0x7efc85c559d0>, <__main__.TfidfCorpusColumn object at 0x7efc78a47e80>]
In the pipeline.working_dir we have the dumps from the previous run.
for file in pipeline.working_dir.iterdir():
print(file)
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/nearest_teams_for_articles.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/words_mat.npz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/words.index
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/articles_lsa.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/ll_clusters_umap.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/ml_clusters_scores.csv.bz2
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/teams_scores.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/vll_clusters_eval_negscores.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/vll_clusters_umap.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/nearest_words_for_labs.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/authors_lsa.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/nearest_labs_for_authors.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/hl_clusters_scores.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/words_umap.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/words.index.dat
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/ll_clusters_labels.json
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/nearest_articles_for_labs.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/ll_clusters_scores.csv.bz2
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/hl_clusters_lsa.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/vll_clusters_labels.json
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/nearest_authors_for_labs.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/hl_clusters_eval_posscores.csv.bz2
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/hl_clusters_eval_negscores.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/nearest_authors_for_words.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/nearest_authors_for_authors.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/labs_lsa.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/vll_clusters_scores.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/nearest_articles_for_teams.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/labs_scores.csv.bz2
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/ml_clusters_labels.json
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/ml_clusters_eval_posscores.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/nearest_teams_for_words.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/ll_clusters_scores.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/nearest_labs_for_teams.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/ml_clusters_eval_negscores.csv.bz2
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/articles.index
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/teams_scores.csv.bz2
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/nearest_authors_for_teams.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/authors_mat.npz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/hl_clusters_umap.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/words_scores.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/lisn_2022.11.15.1_15d3d1d060e5f28d_lsa_200_True_umap_euclidean_20_0.1_random_1.0_None_None.png
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/ml_clusters_eval_negscores.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/lisn_2022.11.15.1_15d3d1d060e5f28d_lsa_200_True_umap_euclidean_20_0.1_random_1.0_None_None_kmeans_ll_clusters.png
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/words_lsa.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/teams_umap.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/nearest_labs_for_words.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/lisn_2022.11.15.1_15d3d1d060e5f28d_lsa_200_True_umap_euclidean_20_0.1_random_1.0_None_None_kmeans_vll_clusters.png
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/nearest_authors_for_articles.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/labs_umap.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/ll_clusters_eval_posscores.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/ml_clusters_lsa.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/articles.index.dat
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/labs_scores.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/authors_scores.csv.bz2
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/vll_clusters_lsa.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/ll_clusters_eval_posscores.csv.bz2
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/authors_umap.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/articles_scores.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/vll_clusters_eval_posscores.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/teams_mat.npz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/vll_clusters_eval_posscores.csv.bz2
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/nearest_articles_for_articles.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/hl_clusters_labels.json
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/hl_clusters_eval_posscores.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/lisn_2022.11.15.1_15d3d1d060e5f28d_lsa_200_True_umap_euclidean_20_0.1_random_1.0_None_None_kmeans_ml_clusters.png
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/labs.index
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/lsa_components.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/teams.index
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/nearest_articles_for_words.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/nearest_labs_for_articles.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/nearest_articles_for_authors.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/nearest_words_for_teams.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/hl_clusters_eval_negscores.csv.bz2
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/nearest_teams_for_authors.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/lisn_2022.11.15.1_15d3d1d060e5f28d_lsa_200_True_umap_euclidean_20_0.1_random_1.0_None_None_kmeans_hl_clusters.png
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/authors.index
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/teams_lsa.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/articles_mat.npz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/hl_clusters_scores.csv.bz2
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/ml_clusters_eval_posscores.csv.bz2
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/nearest_words_for_words.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/words_scores.csv.bz2
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/teams.index.dat
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/labs_mat.npz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/nearest_teams_for_labs.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/ll_clusters_eval_negscores.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/ml_clusters_scores.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/articles_scores.csv.bz2
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/vll_clusters_eval_negscores.csv.bz2
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/authors_scores.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/ll_clusters_eval_negscores.csv.bz2
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/authors.index.dat
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/nearest_words_for_articles.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/nearest_words_for_authors.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/labs.index.dat
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/nearest_teams_for_teams.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/ll_clusters_lsa.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/nearest_labs_for_labs.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/articles_umap.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/ml_clusters_umap.npy.gz
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/export.feather
/builds/2mk6rsew/0/hgozukan/cartolabe-data/dumps/lisn/2022.11.15.1/vll_clusters_scores.csv.bz2
We do not want to use the dumps from previous run. We can either remove the files or run pipeline.generate_entity_matrices with force=True.
We will generate the entity matrices with the new words column and save:
matrices, scores = pipeline.generate_entity_matrices(force=True)
/usr/local/lib/python3.9/site-packages/sklearn/feature_extraction/text.py:406: UserWarning: Your stop_words may be inconsistent with your preprocessing. Tokenizing the stop words generated tokens ['ii', 'iii'] not in stop_words.
warnings.warn(
In the pipeline.yaml file, the pipeline is configured to use cartodata.pipeline.projectionnd.LSAProjection.
pipeline.projection_nd
<cartodata.pipeline.projectionnd.LSAProjection object at 0x7efc94e6ca00>
Instead of LSAProjection, this time we would like to use cartodata.pipeline.projectionnd.LDAProjection.
from cartodata.pipeline.projectionnd import LDAProjection # noqa
projection_nd = LDAProjection(num_dims=80)
pipeline.set_projection_nd(projection_nd)
matrices_nD = pipeline.do_projection_nD(force=True)
""
for file in pipeline.working_dir.glob("*.lda"):
print(file)
Then we will project to 2D. We should not forget to set force=True to make sure that 2D projection is rerun and does not use the saved dumps.
matrices_2D = pipeline.do_projection_2D(force=True)
View the generated map:
from matplotlib.colors import TABLEAU_COLORS # noqa
from matplotlib.lines import Line2D # noqa
labels = tuple(pipeline.natures)
colors = list(TABLEAU_COLORS)[:len(matrices)]
pipeline.plot_map(matrices_2D, labels, colors)

(<Figure size 960x640 with 1 Axes>, <Axes: >)
Now we have all necessary matrices to create clusters and neighbors. First we will create clusters:
(clus_nD, clus_2D, clus_scores, cluster_labels,
cluster_eval_pos, cluster_eval_neg) = pipeline.do_clustering(force=True)
Warning: Less than 2 words in cluster 35 with (1) words.
Warning: Less than 2 words in cluster 43 with (0) words.
Warning: Less than 2 words in cluster 43 with (0) words.
Warning: Less than 2 words in cluster 52 with (1) words.
Warning: Less than 2 words in cluster 52 with (1) words.
Warning: Less than 2 words in cluster 58 with (1) words.
Warning: Less than 2 words in cluster 58 with (1) words.
Warning: Less than 2 words in cluster 70 with (0) words.
Warning: Less than 2 words in cluster 74 with (0) words.
Warning: Less than 2 words in cluster 74 with (0) words.
Warning: Less than 2 words in cluster 78 with (1) words.
Warning: Less than 2 words in cluster 78 with (1) words.
Warning: Less than 2 words in cluster 86 with (0) words.
Warning: Less than 2 words in cluster 86 with (0) words.
Warning: Less than 2 words in cluster 108 with (1) words.
Warning: Less than 2 words in cluster 108 with (1) words.
Warning: Less than 2 words in cluster 112 with (0) words.
Warning: Less than 2 words in cluster 112 with (0) words.
Warning: Less than 2 words in cluster 154 with (1) words.
Warning: Less than 2 words in cluster 154 with (1) words.
Warning: Less than 2 words in cluster 193 with (1) words.
Warning: Less than 2 words in cluster 193 with (1) words.
Warning: Less than 2 words in cluster 213 with (0) words.
Warning: Less than 2 words in cluster 213 with (0) words.
We will view only medium levels clusters:
ml_index = 1
clus_scores_ml = clus_scores[ml_index]
clus_mat_ml = clus_2D[ml_index]
fig_title = (
f"{pipeline.dataset.name} {pipeline.dataset.version} "
f"{pipeline.clustering.natures[ml_index]} {pipeline.projection_nd.key}"
)
fig_ml, ax_ml = pipeline.plot_map(matrices_2D, labels, colors,
title=fig_title,
annotations=clus_scores_ml.index,
annotation_mat=clus_mat_ml)

And save the plot:
pipeline.save_plots()
[(<Figure size 960x640 with 1 Axes>, 'lisn_2022.11.15.1_15d3d1d060e5f28d_lda_80_True_0_20_umap_euclidean_20_0.1_random_1.0_None_None_kmeans_hl_clusters.png'), (<Figure size 960x640 with 1 Axes>, 'lisn_2022.11.15.1_15d3d1d060e5f28d_lda_80_True_0_20_umap_euclidean_20_0.1_random_1.0_None_None_kmeans_ml_clusters.png'), (<Figure size 960x640 with 1 Axes>, 'lisn_2022.11.15.1_15d3d1d060e5f28d_lda_80_True_0_20_umap_euclidean_20_0.1_random_1.0_None_None_kmeans_ll_clusters.png'), (<Figure size 960x640 with 1 Axes>, 'lisn_2022.11.15.1_15d3d1d060e5f28d_lda_80_True_0_20_umap_euclidean_20_0.1_random_1.0_None_None_kmeans_vll_clusters.png')]
Let’s display two cluster images generated by two runs of the pipeline side by side:
image_title_parts = pipeline.title_parts_clus("ml_clusters")
img1 = plt.imread(pipeline.working_dir / image_names[1])
img2 = plt.imread(pipeline.working_dir / ("_".join(image_title_parts) + ".png"))
f, ax = plt.subplots(1, 2, figsize=(20, 10))
ax[0].imshow(img1)
ax[1].imshow(img2)
ax[0].axis('off')
ax[1].axis('off')
plt.tight_layout()
plt.show()

On the left we see the plot of medium level clusters using LSA and UMAP, and on the right we see the results using LDA and UMAP.
Then we will find neighbors:
pipeline.find_neighbors()
Now we are ready to export data to export.feather file.
pipeline.export()
df = pd.read_feather(pipeline.working_dir / "export.feather")
df.head()
It is also possible to change the export configuration.
export_natures = pipeline.export_natures
The pipeline.yaml file defines export configuration for only articles and authors.
for nature in export_natures:
print(nature.key)
articles
authors
Let’s say we would like to add year data to labs nature.
The original dataset contains a column producedDateY_i which contains the year that the article is published. We can add this data as metadata for the point but updating column name with a more clear alternative year.
We can map year data in the file to labs using EntityMetadataMapColumn.
from cartodata.pipeline.exporting import EntityMetadataMapColumn # noqa
meta_year_lab = EntityMetadataMapColumn(entity="labs", column="producedDateY_i",
as_column="year")
Then we will initialize ExportNature for labs using meta_year_lab.
from cartodata.pipeline.exporting import ExportNature # noqa
ex_lab = ExportNature(key="labs",
add_metadata=[meta_year_lab])
new_export_natures = export_natures + [ex_lab]
""
pipeline.export(new_export_natures)
""
df = pd.read_feather(pipeline.working_dir/ "export.feather")
df[df["nature"] == "labs"].head(5)
Total running time of the script: (3 minutes 30.250 seconds)