.. DO NOT EDIT.
.. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY.
.. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE:
.. "auto_examples/pipeline_inriaraweb_lsa_kmeans.py"
.. LINE NUMBERS ARE GIVEN BELOW.

.. only:: html

    .. note::
        :class: sphx-glr-download-link-note

        :ref:`Go to the end <sphx_glr_download_auto_examples_pipeline_inriaraweb_lsa_kmeans.py>`
        to download the full example code.

.. rst-class:: sphx-glr-example-title

.. _sphx_glr_auto_examples_pipeline_inriaraweb_lsa_kmeans.py:


Inria Raweb dataset with Pipeline API
============================================

In this example we will process Inria Raweb dataset using `Pipeline` API. 

The pipeline will comprise of the following steps:

- extract entities 
- use Latent Semantic Analysis (LSA) to generate n-dimensional vector 
  representation of the entities
- use Uniform Manifold Approximation and Projection (UMAP) to project those 
  entities in 2 dimensions
- use KMeans clustering to cluster entities
- find their nearest neighbors.

All files necessary to run Inria Raweb pipeline can be downloaded from https://zenodo.org/record/7970984.

.. GENERATED FROM PYTHON SOURCE LINES 21-27

Create Inria Raweb Dataset
=======================

We will first create Dataset for Inria Raweb.

The CSV file `rawebdf.csv` contains the dataset data.

.. GENERATED FROM PYTHON SOURCE LINES 27-42

.. code-block:: Python


    from cartodata.pipeline.datasets import CSVDataset  # noqa
    from pathlib import Path # noqa

    ROOT_DIR = Path.cwd().parent
    # The directory where files necessary to load dataset columns reside
    INPUT_DIR = ROOT_DIR / "datas"
    # The directory where the generated dump files will be saved
    TOP_DIR = ROOT_DIR / "dumps"

    dataset = CSVDataset("inriaraweb", input_dir=INPUT_DIR, version="1.0.0", filename="raweb.csv",
                         fileurl="https://zenodo.org/record/7970984/files/raweb.csv")

    dataset.df.head()





.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    Downloading data from https://zenodo.org/records/7970984/files/raweb.csv (154.9 MB)


    file_sizes:   0%|                                    | 0.00/162M [00:00<?, ?B/s]
    file_sizes:   2%|▌                          | 3.11M/162M [00:00<00:05, 29.6MB/s]
    file_sizes:   4%|█                          | 6.26M/162M [00:00<00:06, 25.1MB/s]
    file_sizes:   7%|█▉                         | 12.0M/162M [00:00<00:04, 36.2MB/s]
    file_sizes:  10%|██▋                        | 16.2M/162M [00:00<00:03, 38.2MB/s]
    file_sizes:  13%|███▍                       | 20.4M/162M [00:00<00:04, 35.1MB/s]
    file_sizes:  15%|████                       | 24.6M/162M [00:00<00:03, 34.5MB/s]
    file_sizes:  19%|█████▏                     | 30.9M/162M [00:00<00:03, 40.5MB/s]
    file_sizes:  24%|██████▌                    | 39.3M/162M [00:00<00:02, 51.8MB/s]
    file_sizes:  29%|███████▉                   | 47.7M/162M [00:01<00:02, 54.1MB/s]
    file_sizes:  36%|█████████▋                 | 58.2M/162M [00:01<00:01, 66.5MB/s]
    file_sizes:  41%|███████████                | 66.6M/162M [00:01<00:01, 65.1MB/s]
    file_sizes:  47%|████████████▊              | 77.0M/162M [00:01<00:01, 73.9MB/s]
    file_sizes:  53%|██████████████▏            | 85.4M/162M [00:01<00:01, 74.8MB/s]
    file_sizes:  59%|███████████████▉           | 95.9M/162M [00:01<00:00, 74.4MB/s]
    file_sizes:  66%|██████████████████▎         | 106M/162M [00:01<00:00, 77.5MB/s]
    file_sizes:  71%|███████████████████▊        | 115M/162M [00:01<00:00, 77.2MB/s]
    file_sizes:  77%|█████████████████████▌      | 125M/162M [00:02<00:00, 84.1MB/s]
    file_sizes:  84%|███████████████████████▍    | 136M/162M [00:02<00:00, 80.4MB/s]
    file_sizes:  89%|████████████████████████▊   | 144M/162M [00:02<00:00, 74.9MB/s]
    file_sizes:  97%|███████████████████████████ | 157M/162M [00:02<00:00, 84.8MB/s]
    file_sizes: 100%|████████████████████████████| 162M/162M [00:02<00:00, 66.5MB/s]
    Successfully downloaded file to dumps/inriaraweb/1.0.0/raweb.csv
    /builds/2mk6rsew/0/hgozukan/cartolabe-data/cartodata/pipeline/datasets.py:537: DtypeWarning: Columns (7) have mixed types. Specify dtype option on import or set low_memory=False.
      return pd.read_csv(raw_file, **self.kwargs)


.. raw:: html

    <div class="output_subarea output_html rendered_html output_result">
    <div>
    <style scoped>
        .dataframe tbody tr th:only-of-type {
            vertical-align: middle;
        }

        .dataframe tbody tr th {
            vertical-align: top;
        }

        .dataframe thead th {
            text-align: right;
        }
    </style>
    <table border="1" class="dataframe">
      <thead>
        <tr style="text-align: right;">
          <th></th>
          <th>name</th>
          <th>team</th>
          <th>teamyear</th>
          <th>year</th>
          <th>center</th>
          <th>theme</th>
          <th>text</th>
          <th>keywords</th>
        </tr>
      </thead>
      <tbody>
        <tr>
          <th>0</th>
          <td>abs2018_presentation_Overall Objectives (0)</td>
          <td>abs</td>
          <td>abs2018</td>
          <td>2018</td>
          <td>Sophia Antipolis - Méditerranée</td>
          <td>Computational Sciences for Biology, Medicine a...</td>
          <td>Genetic data feature sequences of nucleotides ...</td>
          <td>NaN</td>
        </tr>
        <tr>
          <th>1</th>
          <td>abs2018_fondements_Introduction (1)</td>
          <td>abs</td>
          <td>abs2018</td>
          <td>2018</td>
          <td>Sophia Antipolis - Méditerranée</td>
          <td>Computational Sciences for Biology, Medicine a...</td>
          <td>The research conducted by  . – Modeling interf...</td>
          <td>NaN</td>
        </tr>
        <tr>
          <th>2</th>
          <td>abs2018_fondements_Modeling interfaces and con...</td>
          <td>abs</td>
          <td>abs2018</td>
          <td>2018</td>
          <td>Sophia Antipolis - Méditerranée</td>
          <td>Computational Sciences for Biology, Medicine a...</td>
          <td>The Protein Data Bank,  . The description of i...</td>
          <td>NaN</td>
        </tr>
        <tr>
          <th>3</th>
          <td>abs2018_fondements_Modeling macro-molecular as...</td>
          <td>abs</td>
          <td>abs2018</td>
          <td>2018</td>
          <td>Sophia Antipolis - Méditerranée</td>
          <td>Computational Sciences for Biology, Medicine a...</td>
          <td>Large protein assemblies such as the Nuclear P...</td>
          <td>NaN</td>
        </tr>
        <tr>
          <th>4</th>
          <td>abs2018_fondements_Reconstruction by Data Inte...</td>
          <td>abs</td>
          <td>abs2018</td>
          <td>2018</td>
          <td>Sophia Antipolis - Méditerranée</td>
          <td>Computational Sciences for Biology, Medicine a...</td>
          <td>Large protein assemblies such as the Nuclear P...</td>
          <td>NaN</td>
        </tr>
      </tbody>
    </table>
    </div>
    </div>
    <br />
    <br />

.. GENERATED FROM PYTHON SOURCE LINES 43-44

The dataframe that we just read consists of 118455 rows.

.. GENERATED FROM PYTHON SOURCE LINES 44-47

.. code-block:: Python


    dataset.df.shape[0]





.. rst-class:: sphx-glr-script-out

 .. code-block:: none


    118455



.. GENERATED FROM PYTHON SOURCE LINES 48-49

And has name, team, teamyear, year, center, theme, text and keywords as columns.

.. GENERATED FROM PYTHON SOURCE LINES 49-52

.. code-block:: Python


    print(*dataset.df.columns, sep="\n")





.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    name
    team
    teamyear
    year
    center
    theme
    text
    keywords




.. GENERATED FROM PYTHON SOURCE LINES 53-76

Now we should define our entities and set the column names corresponding to those entities from the data file. We have 7 entities:

| entity | column name in the file |
---------|-------------|
| rawebpart | name |
| teams | teams |
| cwords | text |
| teamyear | teamyear |
| center | center |
| theme | theme |
| words | text |


Cartolabe provides 4 types of columns:


- **IdentityColumn**: The entity of this column represents the main entity of the dataset. The column data corresponding to the entity in the file should contain a single value and this value should be unique among column values. There can only be one `IdentityColumn` in the dataset.
- **CSColumn**: The entity of this column type is related to the main entity, and can contain single or comma separated values.
- **CorpusColumn**: The entity of this column type is the corpus related to the main entity. This can be a combination of multiple columns in the file. It uses a modified version of CountVectorizer(https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html#sklearn.feature_extraction.text.CountVectorizer).
- **TfidfCorpusColumn**: The entity of this column type is the corpus related to the main entity. This can be a combination of multiple columns in the file or can contain filepath from which to read the text corpus. It uses TfidfVectorizer (https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html).

To define the columns we will need additinal files, stopwords_raweb.txt and inriavocab.csv. We can download them from Zenodo and save under `../datas` directory.


.. GENERATED FROM PYTHON SOURCE LINES 76-88

.. code-block:: Python


    from download import download  # noqa

    stopwords_url = "https://zenodo.org/record/7970984/files/stopwords_raweb.txt"
    vocab_url = "https://zenodo.org/record/7970984/files/inriavocab.csv"

    download(stopwords_url, INPUT_DIR / "stopwords_raweb.txt", kind='file',
             progressbar=True, replace=False)

    download(vocab_url, INPUT_DIR / "inriavocab.csv", kind='file',
             progressbar=True, replace=False)





.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    Downloading data from https://zenodo.org/records/7970984/files/stopwords_raweb.txt (5 kB)


    file_sizes:   0%|                                   | 0.00/4.85k [00:00<?, ?B/s]
    file_sizes: 100%|██████████████████████████| 4.85k/4.85k [00:00<00:00, 5.03MB/s]
    Successfully downloaded file to /builds/2mk6rsew/0/hgozukan/cartolabe-data/datas/stopwords_raweb.txt
    Downloading data from https://zenodo.org/records/7970984/files/inriavocab.csv (1.1 MB)


    file_sizes:   0%|                                   | 0.00/1.18M [00:00<?, ?B/s]
    file_sizes: 100%|██████████████████████████| 1.18M/1.18M [00:00<00:00, 19.7MB/s]
    Successfully downloaded file to /builds/2mk6rsew/0/hgozukan/cartolabe-data/datas/inriavocab.csv

    '/builds/2mk6rsew/0/hgozukan/cartolabe-data/datas/inriavocab.csv'



.. GENERATED FROM PYTHON SOURCE LINES 89-90

In this dataset, **rawebpart** is our main entity. We will define it as IdentityColumn:

.. GENERATED FROM PYTHON SOURCE LINES 90-116

.. code-block:: Python


    from cartodata.pipeline.columns import IdentityColumn, CSColumn, CorpusColumn  # noqa


    rawebpart_column = IdentityColumn(nature="rawebpart", column_name="name")

    teams_column = CSColumn(nature="teams", column_name="team",
                            filter_min_score=4)

    cwords_column = CorpusColumn(nature="cwords", column_names=["text"],
                                 stopwords="stopwords_raweb.txt", nb_grams=4,
                                 min_df=25, max_df=0.05, normalize=True,
                                 vocabulary="inriavocab.csv")

    teamyear_column = CSColumn(nature="teamyear", column_name="teamyear",
                               filter_min_score=4)

    center_column = CSColumn(nature="center", column_name="center", separator=";")

    theme_column = CSColumn(nature="theme", column_name="theme", separator=";",
                            filter_nan=True)

    words_column = CorpusColumn(nature="words", column_names=["text"],
                                stopwords="stopwords_raweb.txt", nb_grams=4,
                                min_df=25, max_df=0.1, normalize=True)








.. GENERATED FROM PYTHON SOURCE LINES 117-118

Now we are going to set the columns of the dataset:

.. GENERATED FROM PYTHON SOURCE LINES 118-122

.. code-block:: Python


    dataset.set_columns([rawebpart_column, teams_column, cwords_column, teamyear_column,
                         center_column, theme_column, words_column])








.. GENERATED FROM PYTHON SOURCE LINES 123-131

We can set the columns in any order that we prefer. We will set the first entity as identity entity and the last entity as the corpus. If we set the entities in a different order, the `Dataset` will put the main entity as first.

The dataset for Inria Raweb data is ready. Now we will create and run our pipeline. For this pipeline, we will:

- run LSA projection -> N-dimesional
- run UMAP projection  -> 2D
- cluster entities
- find nearest neighbors

.. GENERATED FROM PYTHON SOURCE LINES 133-137

Create and run pipeline
=========================

We will first create a pipeline with the dataset.

.. GENERATED FROM PYTHON SOURCE LINES 137-142

.. code-block:: Python


    from cartodata.pipeline.common import Pipeline  # noqa

    pipeline = Pipeline(dataset=dataset, top_dir=TOP_DIR, input_dir=INPUT_DIR)








.. GENERATED FROM PYTHON SOURCE LINES 143-144

The workflow generates the `natures` from dataset columns.

.. GENERATED FROM PYTHON SOURCE LINES 144-147

.. code-block:: Python


    pipeline.natures





.. rst-class:: sphx-glr-script-out

 .. code-block:: none


    ['rawebpart', 'teams', 'cwords', 'teamyear', 'center', 'theme', 'words']



.. GENERATED FROM PYTHON SOURCE LINES 148-154

Creating correspondance matrices for each entity type
-------------------------------------------------------------------------------

Now we want to extract matrices that will map the correspondance between each name in the dataset and the entities we want to use.

Pipeline has `generate_entity_matrices` function to generate matrices and scores for each entity (nature) specified for the dataset.

.. GENERATED FROM PYTHON SOURCE LINES 154-157

.. code-block:: Python


    matrices, scores = pipeline.generate_entity_matrices(force=True)








.. GENERATED FROM PYTHON SOURCE LINES 158-163

**Rawebpart**

The first matrix in matrices and Series in scores corresponds to **rawebpart**.

The type for tout column is `IdentityColumn`. It generates a matrix that simply maps each row entry to itself.

.. GENERATED FROM PYTHON SOURCE LINES 163-167

.. code-block:: Python


    rawebpart_mat = matrices[0]
    rawebpart_mat.shape





.. rst-class:: sphx-glr-script-out

 .. code-block:: none


    (118455, 118455)



.. GENERATED FROM PYTHON SOURCE LINES 168-169

Having type `IdentityColumn`, each item will have score 1.

.. GENERATED FROM PYTHON SOURCE LINES 169-175

.. code-block:: Python


    rawebpart_scores = scores[0]
    rawebpart_scores.shape

    rawebpart_scores.head()





.. rst-class:: sphx-glr-script-out

 .. code-block:: none


    abs2018_presentation_Overall Objectives (0)                   1.0
    abs2018_fondements_Introduction (1)                           1.0
    abs2018_fondements_Modeling interfaces and contacts (2)       1.0
    abs2018_fondements_Modeling macro-molecular assemblies (3)    1.0
    abs2018_fondements_Reconstruction by Data Integration (4)     1.0
    dtype: float64



.. GENERATED FROM PYTHON SOURCE LINES 176-181

**Teams**

The second matrix in matrices and score in scores correspond to **teams**.

The type for teams is `CSColumn`. It generates a sparce matrix where rows correspond to rows and columns corresponds to the teams obtained by separating comma separated values.

.. GENERATED FROM PYTHON SOURCE LINES 181-190

.. code-block:: Python


    teams_mat = matrices[1]
    teams_mat.shape

    teams_scores = scores[1]
    teams_scores.head()

    teams_scores.shape





.. rst-class:: sphx-glr-script-out

 .. code-block:: none


    (469,)



.. GENERATED FROM PYTHON SOURCE LINES 191-196

**Cwords**

The third matrix in matrices and score in scores correspond to **cwords**.

The type for cwords column is `CorpusColumn`. It uses text column in the dataset, and then extracts n-grams from that corpus using a fixed vocabulary `../datas/inriavocab.csv`. Finally it generates a sparce matrix where rows correspond to each entry in the dataset and columns corresponds to n-grams.

.. GENERATED FROM PYTHON SOURCE LINES 196-203

.. code-block:: Python


    cwords_mat = matrices[2]
    cwords_mat.shape

    cwords_scores = scores[2]
    cwords_scores.head()





.. rst-class:: sphx-glr-script-out

 .. code-block:: none


    abandon        32
    abbreviated    32
    abdalla        49
    abdominal      50
    abelian        86
    dtype: int64



.. GENERATED FROM PYTHON SOURCE LINES 204-209

**Teamyear**

The fourth matrix in matrices and score in scores correspond to **teamyear**.

The type for teamyear is `CSColumn`. It generates a sparce matrix where rows correspond to rows and columns corresponds to the team year values obtained by separating comma separated values.

.. GENERATED FROM PYTHON SOURCE LINES 209-216

.. code-block:: Python


    teamyear_mat = matrices[3]
    teamyear_mat.shape

    teamyear_scores = scores[3]
    teamyear_scores.head()





.. rst-class:: sphx-glr-script-out

 .. code-block:: none


    abs2018       29
    acumes2018    46
    agora2018     43
    airsea2018    49
    alice2018     26
    dtype: int64



.. GENERATED FROM PYTHON SOURCE LINES 217-222

**Center**

The fifth matrix in matrices and score in scores correspond to **center**.

The type for center is `CSColumn`. It generates a sparce matrix where rows correspond to rows and columns corresponds to the centers.

.. GENERATED FROM PYTHON SOURCE LINES 222-229

.. code-block:: Python


    center_mat = matrices[4]
    center_mat.shape

    center_scores = scores[4]
    center_scores.head()





.. rst-class:: sphx-glr-script-out

 .. code-block:: none


    Sophia Antipolis - Méditerranée    20884
    Grenoble - Rhône-Alpes             19230
    Nancy - Grand Est                  12171
    Paris                               4595
    Bordeaux - Sud-Ouest                8728
    dtype: int64



.. GENERATED FROM PYTHON SOURCE LINES 230-235

**Theme**

The sixth matrix in matrices and score in scores correspond to **theme**.

The type for theme is `CSColumn`. It generates a sparce matrix where rows correspond to rows and columns corresponds to the theme obtained by separating comma separated values.

.. GENERATED FROM PYTHON SOURCE LINES 235-242

.. code-block:: Python


    theme_mat = matrices[5]
    theme_mat.shape

    theme_scores = scores[5]
    theme_scores.head()





.. rst-class:: sphx-glr-script-out

 .. code-block:: none


    Computational Sciences for Biology, Medicine and the Environment    18003
    Applied Mathematics, Computation and Simulation                     16704
    Networks, Systems and Services, Distributed Computing               19866
    Perception, Cognition, Interaction                                  22013
    Algorithmics, Programming, Software and Architecture                19166
    dtype: int64



.. GENERATED FROM PYTHON SOURCE LINES 243-248

**Words**

The seventh matrix in matrices and score in scores correspond to **words**.

The type for words column is `CorpusColumn`. It creates a corpus merging multiple text columns in the dataset, and then extracts n-grams from that corpus. Finally it generates a sparce matrix where rows correspond to articles and columns corresponds to n-grams.

.. GENERATED FROM PYTHON SOURCE LINES 248-252

.. code-block:: Python


    words_mat = matrices[6]
    words_mat.shape





.. rst-class:: sphx-glr-script-out

 .. code-block:: none


    (118455, 56532)



.. GENERATED FROM PYTHON SOURCE LINES 253-258

Here we see that there are 56532 distinct n-grams.

The series, which we named `words_scores`, contains the list of n-grams
with a score that is equal to the number of rows that this value
was mapped within the `words_mat` matrix.

.. GENERATED FROM PYTHON SOURCE LINES 258-262

.. code-block:: Python


    words_scores = scores[6]
    words_scores.head(10)





.. rst-class:: sphx-glr-script-out

 .. code-block:: none


    015879     93
    10000     159
    100ms      27
    150000     27
    1960s      43
    1970s      74
    1980s      67
    1990s      98
    20000      39
    2000s      47
    dtype: int64



.. GENERATED FROM PYTHON SOURCE LINES 263-273

Dimension reduction
------------------------------

One way to see the matrices that we created is as coordinates in the space of
all articles. What we want to do is to reduce the dimension of this space to
make it easier to work with and see.

**LSA projection**

We'll start by using the LSA (Latent Semantic Analysis) technique to reduce the number of rows in our data.

.. GENERATED FROM PYTHON SOURCE LINES 273-281

.. code-block:: Python


    from cartodata.pipeline.projectionnd import LSAProjection  # noqa

    num_dim = 100

    lsa_projection = LSAProjection(num_dim)
    pipeline.set_projection_nd(lsa_projection)








.. GENERATED FROM PYTHON SOURCE LINES 282-287

Now we can run LSA projection on the matrices.

In our matrices we have 2 columns generated from given corpus; cwords and words. When we create the dataset and set it columns, the dataset sets the index for corpus column as corpus_index. When there are more then one columns of type `cartodata.pipeline.columns.CorpusColumn`, the index of the final one is set as corpus_index. In our case 6.

We would like to use cwords column as corpus column for LSA projection. So before running the projection, we should set the corpus_index.

.. GENERATED FROM PYTHON SOURCE LINES 287-296

.. code-block:: Python


    pipeline.dataset.corpus_index = 2

    ""
    matrices_nD = pipeline.do_projection_nD(matrices, force=True)

    for nature, matrix in zip(pipeline.natures, matrices_nD):
        print(f"{nature}  -------------   {matrix.shape}")





.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    rawebpart  -------------   (100, 118455)
    teams  -------------   (100, 469)
    cwords  -------------   (100, 29348)
    teamyear  -------------   (100, 2715)
    center  -------------   (100, 10)
    theme  -------------   (100, 5)
    words  -------------   (100, 56532)




.. GENERATED FROM PYTHON SOURCE LINES 297-309

We have 100 rows for each entity.

This makes it easier to work with them for clustering or nearest neighbors
tasks, but we also want to project them on a 2D space to be able to map them.

**UMAP projection**

The `UMAP <https://github.com/lmcinnes/umap>`_ (Uniform Manifold Approximation
and Projection) is a dimension reduction technique that can be used for
visualisation similarly to t-SNE.

We use this algorithm to project our matrices in 2 dimensions.

.. GENERATED FROM PYTHON SOURCE LINES 309-317

.. code-block:: Python


    from cartodata.pipeline.projection2d import UMAPProjection  # noqa


    umap_projection = UMAPProjection(n_neighbors=10, min_dist=0.1)

    pipeline.set_projection_2d(umap_projection)








.. GENERATED FROM PYTHON SOURCE LINES 318-319

Now we can run UMAP projection on the LSA matrices.

.. GENERATED FROM PYTHON SOURCE LINES 319-322

.. code-block:: Python


    matrices_2D = pipeline.do_projection_2D(force=True)








.. GENERATED FROM PYTHON SOURCE LINES 323-325

Now that we have 2D coordinates for our points, we can try to plot them to
get a feel of the data's shape.

.. GENERATED FROM PYTHON SOURCE LINES 325-332

.. code-block:: Python


    labels = tuple(pipeline.natures)
    colors = ['darkgreen', 'red', 'cyan', 'navy',
              'peru', 'gold', 'pink', 'cornflowerblue']

    fig, ax = pipeline.plot_map(matrices_2D, labels, colors)




.. image-sg:: /auto_examples/images/sphx_glr_pipeline_inriaraweb_lsa_kmeans_001.png
   :alt: pipeline inriaraweb lsa kmeans
   :srcset: /auto_examples/images/sphx_glr_pipeline_inriaraweb_lsa_kmeans_001.png
   :class: sphx-glr-single-img





.. GENERATED FROM PYTHON SOURCE LINES 333-342

The plot above, as we don't have labels for the points, doesn't make much sense
as is. But we can see that the data shows some clusters which we could try to identify.

Clustering
---------------

In order to identify clusters, we use the KMeans clustering technique on the
articles. We'll also try to label these clusters by selecting the most
frequent words that appear in each cluster's articles.

.. GENERATED FROM PYTHON SOURCE LINES 342-345

.. code-block:: Python


    from cartodata.pipeline.clustering import KMeansClustering  # noqa








.. GENERATED FROM PYTHON SOURCE LINES 346-347

level of clusters, hl: high level, ml: medium level, ll: low level

.. GENERATED FROM PYTHON SOURCE LINES 347-354

.. code-block:: Python

    cluster_natures = ["hl_clusters", "ml_clusters", "ll_clusters", "vll_clusters"]

    kmeans_clustering = KMeansClustering(
        n=8, base_factor=3, natures=cluster_natures)

    pipeline.set_clustering(kmeans_clustering)








.. GENERATED FROM PYTHON SOURCE LINES 355-356

Now we can run clustering on the matrices.

.. GENERATED FROM PYTHON SOURCE LINES 356-360

.. code-block:: Python


    (clus_nD, clus_2D, clus_scores, cluster_labels,
    cluster_eval_pos, cluster_eval_neg) = pipeline.do_clustering()








.. GENERATED FROM PYTHON SOURCE LINES 361-362

As we have specified two levels of clustering, the returned lists wil have two values.

.. GENERATED FROM PYTHON SOURCE LINES 362-365

.. code-block:: Python


    len(clus_2D)





.. rst-class:: sphx-glr-script-out

 .. code-block:: none


    4



.. GENERATED FROM PYTHON SOURCE LINES 366-367

We will now display two levels of clusters in separate plots, we will start with high level clusters:

.. GENERATED FROM PYTHON SOURCE LINES 367-377

.. code-block:: Python


    clus_scores_hl = clus_scores[0]
    clus_mat_hl = clus_2D[0]


    fig_hl, ax_hl = pipeline.plot_map(matrices_2D, labels, colors,
                                      title="Inria Raweb Dataset High Level Clusters",
                                      annotations=clus_scores_hl.index,
                                      annotation_mat=clus_mat_hl)




.. image-sg:: /auto_examples/images/sphx_glr_pipeline_inriaraweb_lsa_kmeans_002.png
   :alt: Inria Raweb Dataset High Level Clusters
   :srcset: /auto_examples/images/sphx_glr_pipeline_inriaraweb_lsa_kmeans_002.png
   :class: sphx-glr-single-img





.. GENERATED FROM PYTHON SOURCE LINES 378-382

The 8 high level clusters that we created give us a general idea of what the big
clusters of data contain.

With medium level clusters we have a finer level of detail:

.. GENERATED FROM PYTHON SOURCE LINES 382-398

.. code-block:: Python


    clus_scores_ml = clus_scores[1]
    clus_mat_ml = clus_2D[1]

    fig_ml, ax_ml = pipeline.plot_map(matrices_2D, labels, colors,
                                      title="Inria Raweb Dataset Medium Level Clusters",
                                      annotations=clus_scores_ml.index,
                                      annotation_mat=clus_mat_ml)
    ""
    pipeline.save_plot(fig_hl, "inriaraweb_hl_clusters.png")
    pipeline.save_plot(fig_ml, "inriaraweb_ml_clusters.png")


    for file in pipeline.top_dir.glob("*.png"):
        print(file)








.. GENERATED FROM PYTHON SOURCE LINES 399-410

Nearest neighbors
----------------------------

One more thing which could be useful to appreciate the quality of our data
would be to get each point's nearest neighbors.

Finding nearest neighbors is a common task with various algorithms aiming to
solve it. The `find_neighbors` method uses one of these algorithms to find the
nearest points of all entities. It takes an optional weight parameter to tweak
the distance calculation to select points that have a higher score but are
maybe a bit farther instead of just selecting the closest neighbors.

.. GENERATED FROM PYTHON SOURCE LINES 410-423

.. code-block:: Python


    from cartodata.pipeline.neighbors import AllNeighbors  # noqa

    n_neighbors = 10
    weights = [0, 0, 0, 0, 0, 0, 0.3]

    neighboring = AllNeighbors(n_neighbors=n_neighbors, power_scores=weights)

    pipeline.set_neighboring(neighboring)

    pipeline.find_neighbors()









.. GENERATED FROM PYTHON SOURCE LINES 424-451

Export file using exporter
=======================

We can now export the data. To export the data, we need to configure the exporter.

The exported data will be the points extracted from the dataset corresponding to the entities that we have defined.

In the export file, we will have the following columns for each point:


| column | value |
---------|-------------|
| nature |  one of tout, prop, question, words, theme, avispositif, avisnegatif, gdprop |
| label | point's label |
| score | point's score |
| rank |  point's rank |
| x | point's x location on the map |
| y | point's y location on the map |
| nn_rawebpart | neighboring entries to this point |
| nn_teams | neighboring props to this point |
| nn_cwords | neighboring questions to this point |
| nn_teamyear | neighboring words to this point |
| nn_lab | neighboring themes to this point |
| nn_theme | neighboring avispositifs to this point |
| nn_words | neighboring avisnegatifs to this point |

we will call `pipeline.export` function. It will create `export.feather` file and save under `pipeline.top_dir`.

.. GENERATED FROM PYTHON SOURCE LINES 451-454

.. code-block:: Python


    pipeline.export()








.. GENERATED FROM PYTHON SOURCE LINES 455-456

Let's display the contents of the file.

.. GENERATED FROM PYTHON SOURCE LINES 456-462

.. code-block:: Python


    import pandas as pd  # noqa

    df = pd.read_feather(pipeline.working_dir / "export.feather")
    df.head()






.. raw:: html

    <div class="output_subarea output_html rendered_html output_result">
    <div>
    <style scoped>
        .dataframe tbody tr th:only-of-type {
            vertical-align: middle;
        }

        .dataframe tbody tr th {
            vertical-align: top;
        }

        .dataframe thead th {
            text-align: right;
        }
    </style>
    <table border="1" class="dataframe">
      <thead>
        <tr style="text-align: right;">
          <th></th>
          <th>nature</th>
          <th>label</th>
          <th>score</th>
          <th>rank</th>
          <th>x</th>
          <th>y</th>
          <th>nn_rawebpart</th>
          <th>nn_teams</th>
          <th>nn_cwords</th>
          <th>nn_teamyear</th>
          <th>nn_center</th>
          <th>nn_theme</th>
          <th>nn_words</th>
        </tr>
      </thead>
      <tbody>
        <tr>
          <th>0</th>
          <td>rawebpart</td>
          <td>abs2018_presentation_Overall Objectives (0)</td>
          <td>1.0</td>
          <td>0</td>
          <td>15.988950</td>
          <td>5.829741</td>
          <td>30400,78992,60333,10081,0,95842,40748,50613,10...</td>
          <td>118793,118455,118704,118760,118597,118560,1186...</td>
          <td>124140,136585,147930,140420,124139,146548,1290...</td>
          <td>150181,150563,150731,150744,150915,150902,1505...</td>
          <td>150992,150996,150990,150993,150995,150988,1509...</td>
          <td>150998,150997,151001,151000,150999</td>
          <td>161309,166421,183775,189040,206871,185628,1578...</td>
        </tr>
        <tr>
          <th>1</th>
          <td>rawebpart</td>
          <td>abs2018_fondements_Introduction (1)</td>
          <td>1.0</td>
          <td>1</td>
          <td>8.111205</td>
          <td>0.651574</td>
          <td>732,415,8512,8969,1290,258,5716,6417,5906,446</td>
          <td>118455,118597,118704,118620,118563,118758,1186...</td>
          <td>133632,135516,133631,132061,135519,132064,1237...</td>
          <td>150042,150638,149160,150448,150251,150181,1493...</td>
          <td>150992,150988,150991,150996,150987,150993,1509...</td>
          <td>150997,150998,150999,151001,151000</td>
          <td>183144,180152,176856,183145,176857,180153,1837...</td>
        </tr>
        <tr>
          <th>2</th>
          <td>rawebpart</td>
          <td>abs2018_fondements_Modeling interfaces and con...</td>
          <td>1.0</td>
          <td>2</td>
          <td>0.950007</td>
          <td>17.706457</td>
          <td>30402,40750,2,60336,78995,95845,103765,69757,8...</td>
          <td>118455,118597,118736,118774,118483,118463,1185...</td>
          <td>132060,141830,135098,126767,120051,121264,1336...</td>
          <td>150042,150448,150251,150638,149827,149386,1496...</td>
          <td>150992,150987,150990,150988,150996,150995,1509...</td>
          <td>151001,151000,150997,150998,150999</td>
          <td>176861,176845,192838,168888,199623,182955,1952...</td>
        </tr>
        <tr>
          <th>3</th>
          <td>rawebpart</td>
          <td>abs2018_fondements_Modeling macro-molecular as...</td>
          <td>1.0</td>
          <td>3</td>
          <td>7.937434</td>
          <td>0.504809</td>
          <td>30403,10084,3,40751,20160,50616,69758,30404,10...</td>
          <td>118455,118483,118597,118918,118704,118698,1188...</td>
          <td>140203,132812,146551,120678,140205,142768,1427...</td>
          <td>148964,150251,150448,149086,148856,148742,1496...</td>
          <td>150988,150996,150991,150995,150992,150990,1509...</td>
          <td>150997,150998,150999,151000,151001</td>
          <td>192446,192447,178626,178627,183762,165961,1940...</td>
        </tr>
        <tr>
          <th>4</th>
          <td>rawebpart</td>
          <td>abs2018_fondements_Reconstruction by Data Inte...</td>
          <td>1.0</td>
          <td>4</td>
          <td>7.934400</td>
          <td>0.505640</td>
          <td>30404,4,60338,10085,40752,20161,50617,69759,3,...</td>
          <td>118455,118597,118483,118918,118704,118698,1188...</td>
          <td>132812,140203,120678,146551,140205,140213,1265...</td>
          <td>148964,150251,150448,149086,148856,149609,1487...</td>
          <td>150996,150988,150991,150992,150995,150989,1509...</td>
          <td>150997,150999,150998,151001,151000</td>
          <td>183762,192446,178626,185977,192447,178627,1659...</td>
        </tr>
      </tbody>
    </table>
    </div>
    </div>
    <br />
    <br />

.. GENERATED FROM PYTHON SOURCE LINES 463-464

This is a basic export file. For each point, we can add additional columns.

.. GENERATED FROM PYTHON SOURCE LINES 464-499

.. code-block:: Python


    from cartodata.pipeline.exporting import (
        ExportNature, MetadataColumn
    )  # noqa


    meta_year_article = MetadataColumn(column="year", as_column="year",
                                       func="x.astype(str)")

    ex_rawebpart = ExportNature(key="rawebpart",
                                refs=["center", "teams",
                                      "cwords", "theme", "words"],
                                add_metadata=[meta_year_article])

    ex_teams = ExportNature(key="teams",
                            refs=["center", "cwords", "theme", "words"])

    ex_teamyear = ExportNature(key="teamyear",
                               refs=["center", "teams", "cwords", "theme", "words"])

    ""
    pipeline.export(export_natures=[ex_rawebpart, ex_teams, ex_teamyear])

    ""
    df = pd.read_feather(pipeline.working_dir / "export.feather")
    df.head(5)

    ""
    df[df.nature == "rawebpart"].head(1)

    ""
    df[df.nature == "teams"].head(1)

    ""
    df[df.nature == "teamyear"].head(5)





.. raw:: html

    <div class="output_subarea output_html rendered_html output_result">
    <div>
    <style scoped>
        .dataframe tbody tr th:only-of-type {
            vertical-align: middle;
        }

        .dataframe tbody tr th {
            vertical-align: top;
        }

        .dataframe thead th {
            text-align: right;
        }
    </style>
    <table border="1" class="dataframe">
      <thead>
        <tr style="text-align: right;">
          <th></th>
          <th>nature</th>
          <th>label</th>
          <th>score</th>
          <th>rank</th>
          <th>x</th>
          <th>y</th>
          <th>nn_rawebpart</th>
          <th>nn_teams</th>
          <th>nn_cwords</th>
          <th>nn_teamyear</th>
          <th>nn_center</th>
          <th>nn_theme</th>
          <th>nn_words</th>
          <th>center</th>
          <th>teams</th>
          <th>cwords</th>
          <th>theme</th>
          <th>words</th>
          <th>year</th>
        </tr>
      </thead>
      <tbody>
        <tr>
          <th>148272</th>
          <td>teamyear</td>
          <td>abs2018</td>
          <td>29.0</td>
          <td>148272</td>
          <td>7.980851</td>
          <td>0.599843</td>
          <td>732,415,8512,8969,1290,258,5716,6417,5906,446</td>
          <td>118455,118920,118918,118690,118704,118506,1185...</td>
          <td>120537,123288,138528,121264,127390,130600,1370...</td>
          <td>148272,148491,149160,148709,148932,149386,1502...</td>
          <td>150992,150993,150987,150996,150991,150995,1509...</td>
          <td>150998,150997,150999,151001,151000</td>
          <td>205458,178193,181003,159710,205490,154847,1659...</td>
          <td>150987</td>
          <td>118455</td>
          <td>119064,119438,119475,119525,119565,119581,1196...</td>
          <td>150997</td>
          <td>151318,151929,152064,152178,152256,152327,1523...</td>
          <td>None</td>
        </tr>
        <tr>
          <th>148273</th>
          <td>teamyear</td>
          <td>acumes2018</td>
          <td>46.0</td>
          <td>148273</td>
          <td>5.878605</td>
          <td>1.921777</td>
          <td>108509,108746,255,1290,4456,64695,23730,730,11...</td>
          <td>118456,118683,118598,118765,118719,118499,1188...</td>
          <td>142536,146631,133639,146638,124691,146637,1443...</td>
          <td>148273,148492,148562,148710,148933,149317,1495...</td>
          <td>150995,150991,150990,150988,150987,150992,1509...</td>
          <td>150998,150997,150999,151000,151001</td>
          <td>204382,180157,180164,204390,204394,204399,2044...</td>
          <td>150987</td>
          <td>118456</td>
          <td>119110,119115,119117,119132,119162,119175,1192...</td>
          <td>150998</td>
          <td>151102,151141,151144,151329,151356,151371,1513...</td>
          <td>None</td>
        </tr>
        <tr>
          <th>148274</th>
          <td>teamyear</td>
          <td>agora2018</td>
          <td>43.0</td>
          <td>148274</td>
          <td>4.853877</td>
          <td>2.024834</td>
          <td>732,415,8512,8969,1290,258,5716,6417,5906,446</td>
          <td>118457,118723,118841,118530,118524,118686,1185...</td>
          <td>122063,121487,138786,136289,133625,140743,1473...</td>
          <td>148274,148493,148922,149151,149377,149601,1498...</td>
          <td>150988,150995,150990,150993,150987,150994,1509...</td>
          <td>150999,151000,150998,150997,151001</td>
          <td>182595,205689,158094,197423,203670,157050,1769...</td>
          <td>150988</td>
          <td>118457</td>
          <td>119000,119010,119024,119034,119037,119038,1190...</td>
          <td>150999</td>
          <td>151044,151102,151141,151188,151199,151221,1512...</td>
          <td>None</td>
        </tr>
        <tr>
          <th>148275</th>
          <td>teamyear</td>
          <td>airsea2018</td>
          <td>49.0</td>
          <td>148275</td>
          <td>9.110352</td>
          <td>11.509332</td>
          <td>732,415,8512,8969,1290,258,5716,6417,5906,446</td>
          <td>118458,118764,118708,118640,118528,118462,1188...</td>
          <td>120816,147476,120823,137018,120819,120820,1370...</td>
          <td>148275,148711,148494,148934,149961,150176,1497...</td>
          <td>150988,150995,150990,150987,150992,150994,1509...</td>
          <td>150997,150998,150999,151001,151000</td>
          <td>154644,186394,154839,170724,206032,170732,1677...</td>
          <td>150988</td>
          <td>118458</td>
          <td>118934,118954,119000,119024,119091,119104,1191...</td>
          <td>150997</td>
          <td>151051,151088,151188,151191,151244,151258,1513...</td>
          <td>None</td>
        </tr>
        <tr>
          <th>148276</th>
          <td>teamyear</td>
          <td>alice2018</td>
          <td>26.0</td>
          <td>148276</td>
          <td>12.489779</td>
          <td>3.616100</td>
          <td>732,415,8512,8969,1290,258,5716,6417,5906,446</td>
          <td>118459,118752,118589,118467,118658,118736,1187...</td>
          <td>136414,129748,139525,129815,146154,134707,1297...</td>
          <td>148276,148495,148571,148712,149835,148376,1489...</td>
          <td>150990,150989,150988,150992,150993,150991,1509...</td>
          <td>150998,151000,151001,150997,150999</td>
          <td>161021,187933,185358,155801,156616,201920,2019...</td>
          <td>150989</td>
          <td>118459</td>
          <td>119029,119057,119212,119216,119315,119320,1193...</td>
          <td>151000</td>
          <td>151258,151260,151304,151559,151569,151832,1518...</td>
          <td>None</td>
        </tr>
      </tbody>
    </table>
    </div>
    </div>
    <br />
    <br />


.. rst-class:: sphx-glr-timing

   **Total running time of the script:** (20 minutes 46.555 seconds)


.. _sphx_glr_download_auto_examples_pipeline_inriaraweb_lsa_kmeans.py:

.. only:: html

  .. container:: sphx-glr-footer sphx-glr-footer-example

    .. container:: sphx-glr-download sphx-glr-download-jupyter

      :download:`Download Jupyter notebook: pipeline_inriaraweb_lsa_kmeans.ipynb <pipeline_inriaraweb_lsa_kmeans.ipynb>`

    .. container:: sphx-glr-download sphx-glr-download-python

      :download:`Download Python source code: pipeline_inriaraweb_lsa_kmeans.py <pipeline_inriaraweb_lsa_kmeans.py>`

    .. container:: sphx-glr-download sphx-glr-download-zip

      :download:`Download zipped: pipeline_inriaraweb_lsa_kmeans.zip <pipeline_inriaraweb_lsa_kmeans.zip>`


.. only:: html

 .. rst-class:: sphx-glr-signature

    `Gallery generated by Sphinx-Gallery <https://sphinx-gallery.github.io>`_