.. DO NOT EDIT.
.. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY.
.. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE:
.. "auto_examples/workflow_lisn_doc2vec_kmeans.py"
.. LINE NUMBERS ARE GIVEN BELOW.

.. only:: html

    .. note::
        :class: sphx-glr-download-link-note

        :ref:`Go to the end <sphx_glr_download_auto_examples_workflow_lisn_doc2vec_kmeans.py>`
        to download the full example code.

.. rst-class:: sphx-glr-example-title

.. _sphx_glr_auto_examples_workflow_lisn_doc2vec_kmeans.py:


Extracting and processing LISN data for Cartolabe ( Doc2vec projection)
==========================================================================

In this example we will:

- extract entities (authors, articles, labs, words) from a collection of
  scientific articles
- project those entities in 2 dimensions
- cluster them
- find their nearest neighbors.

.. GENERATED FROM PYTHON SOURCE LINES 16-20

Download data
=============

We will first download the CSV file that contains all articles from HAL (https://hal.archives-ouvertes.fr/) published by authors from LISN (Laboratoire Interdisciplinaire des Sciences du Numérique) between 2000-2022. 

.. GENERATED FROM PYTHON SOURCE LINES 20-28

.. code-block:: Python


    from download import download

    csv_url = "https://zenodo.org/record/7323538/files/lisn_2000_2022.csv"

    download(csv_url, "../datas/lisn_2000_2022.csv", kind='file',
                             progressbar=True, replace=False)





.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    Replace is False and data exists, so doing nothing. Use replace=True to re-download the data.

    '../datas/lisn_2000_2022.csv'



.. GENERATED FROM PYTHON SOURCE LINES 29-31

Load data to dataframe
===================

.. GENERATED FROM PYTHON SOURCE LINES 31-38

.. code-block:: Python


    import pandas as pd  # noqa

    df = pd.read_csv('../datas/lisn_2000_2022.csv', index_col=0)

    df.head()






.. raw:: html

    <div class="output_subarea output_html rendered_html output_result">
    <div>
    <style scoped>
        .dataframe tbody tr th:only-of-type {
            vertical-align: middle;
        }

        .dataframe tbody tr th {
            vertical-align: top;
        }

        .dataframe thead th {
            text-align: right;
        }
    </style>
    <table border="1" class="dataframe">
      <thead>
        <tr style="text-align: right;">
          <th></th>
          <th>structId_i</th>
          <th>authFullName_s</th>
          <th>en_abstract_s</th>
          <th>en_keyword_s</th>
          <th>en_title_s</th>
          <th>structAcronym_s</th>
          <th>producedDateY_i</th>
          <th>producedDateM_i</th>
          <th>halId_s</th>
          <th>docid</th>
          <th>en_domainAllCodeLabel_fs</th>
        </tr>
      </thead>
      <tbody>
        <tr>
          <th>0</th>
          <td>[2544, 92966, 411575, 441569]</td>
          <td>Frédéric Blanqui</td>
          <td>In the last twenty years, several approaches t...</td>
          <td>Higher-order rewriting,Termination,Confluence</td>
          <td>Termination and Confluence of Higher-Order Rew...</td>
          <td>LRI,UP11,CNRS,LISN</td>
          <td>2000</td>
          <td>7.0</td>
          <td>inria-00105556</td>
          <td>105556</td>
          <td>Logic in Computer Science,Computer Science</td>
        </tr>
        <tr>
          <th>1</th>
          <td>[2544, 92966, 411575, 441569]</td>
          <td>Sébastien Tixeuil</td>
          <td>When a distributed system is subject to transi...</td>
          <td>Self-stabilization,Distributed Systems,Distrib...</td>
          <td>Efficient Self-stabilization</td>
          <td>LRI,UP11,CNRS,LISN</td>
          <td>2000</td>
          <td>1.0</td>
          <td>tel-00124843</td>
          <td>124843</td>
          <td>Networking and Internet Architecture,Computer ...</td>
        </tr>
        <tr>
          <th>2</th>
          <td>[1167, 300340, 301492, 564132, 441569, 2544, 9...</td>
          <td>Michèle Sebag,Céline Rouveirol</td>
          <td>One of the obstacles to widely using first-ord...</td>
          <td>Bounded reasoning,First order logic,Inductive ...</td>
          <td>Resource-bounded relational reasoning: inducti...</td>
          <td>LMS,X,PSL,CNRS,LRI,UP11,CNRS,LISN</td>
          <td>2000</td>
          <td>NaN</td>
          <td>hal-00111312</td>
          <td>2263842</td>
          <td>Mechanics,Engineering Sciences,physics</td>
        </tr>
        <tr>
          <th>3</th>
          <td>[994, 15786, 301340, 303171, 441569, 34499, 81...</td>
          <td>Philippe Balbiani,Jean-François Condotta,Gérar...</td>
          <td>This paper organizes the topologic forms of th...</td>
          <td>Temporal reasoning,Constraint handling,Computa...</td>
          <td>Reasoning about generalized intervals : Horn r...</td>
          <td>LIPN,UP13,USPC,CNRS,IRIT,UT1,UT2J,UT3,CNRS,Tou...</td>
          <td>2000</td>
          <td>NaN</td>
          <td>hal-03300321</td>
          <td>3300321</td>
          <td>Artificial Intelligence,Computer Science</td>
        </tr>
        <tr>
          <th>4</th>
          <td>[1315, 25027, 59704, 564132, 300009, 441569, 4...</td>
          <td>Roberto Di Cosmo,Delia Kesner,Emmanuel Polonovski</td>
          <td>We refine the simulation technique introduced ...</td>
          <td>Linear logic,Proof nets,Lambda-calculus,Explic...</td>
          <td>Proof Nets and Explicit Substitutions</td>
          <td>LIENS,DI-ENS,ENS-PSL,PSL,Inria,CNRS,CNRS,LRI,U...</td>
          <td>2000</td>
          <td>NaN</td>
          <td>hal-00384955</td>
          <td>384955</td>
          <td>Logic in Computer Science,Computer Science</td>
        </tr>
      </tbody>
    </table>
    </div>
    </div>
    <br />
    <br />

.. GENERATED FROM PYTHON SOURCE LINES 39-40

The dataframe that we just read consists of 4262 articles as rows.

.. GENERATED FROM PYTHON SOURCE LINES 40-43

.. code-block:: Python


    print(df.shape[0])





.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    4262




.. GENERATED FROM PYTHON SOURCE LINES 44-45

And their authors, abstract, keywords, title, research labs and domain as columns.

.. GENERATED FROM PYTHON SOURCE LINES 45-48

.. code-block:: Python


    print(*df.columns, sep="\n")





.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    structId_i
    authFullName_s
    en_abstract_s
    en_keyword_s
    en_title_s
    structAcronym_s
    producedDateY_i
    producedDateM_i
    halId_s
    docid
    en_domainAllCodeLabel_fs




.. GENERATED FROM PYTHON SOURCE LINES 49-62

Creating correspondance matrices for each entity type
=====================================================

From this table of articles, we want to extract matrices that will map the
correspondance between these articles and the entities we want to use.

Authors
------------

Let's start with the authors for example. We want to create a matrix where
the rows represent the articles and the columns represent the authors. Each
cell (n, m) will have a 1 in it if the *nth* article was written by the *mth*
author.

.. GENERATED FROM PYTHON SOURCE LINES 62-67

.. code-block:: Python


    from cartodata.loading import load_comma_separated_column  # noqa

    authors_mat, authors_scores = load_comma_separated_column(df, 'authFullName_s')








.. GENERATED FROM PYTHON SOURCE LINES 68-77

The `load_comma_separated_column` function takes in a dataframe and the name
of a column and returns two objects:

- a sparse matrix
- a pandas `Series`

Each column of the sparce matrix `authors_mat`, corresponds to an author and
each row corresponds to an article. We see that there are 7348 distict
authors for 4262 articles.

.. GENERATED FROM PYTHON SOURCE LINES 77-80

.. code-block:: Python


    authors_mat.shape





.. rst-class:: sphx-glr-script-out

 .. code-block:: none


    (4262, 7348)



.. GENERATED FROM PYTHON SOURCE LINES 81-85

The series, which we named `authors_scores`, contains the list of authors
extracted from the column `authFullName_s` with a score that is equal to the
number of rows (articles) that this value was mapped within the `authors_mat`
matrix.

.. GENERATED FROM PYTHON SOURCE LINES 85-88

.. code-block:: Python


    authors_scores.head()





.. rst-class:: sphx-glr-script-out

 .. code-block:: none


    Frédéric Blanqui       4
    Sébastien Tixeuil     47
    Michèle Sebag        137
    Céline Rouveirol       2
    Philippe Balbiani      2
    dtype: int64



.. GENERATED FROM PYTHON SOURCE LINES 89-92

If we look at the *2nd* column of the matrix, which corresponds to the author
**Sébastien Tixeuil**, we can see that it has 47 non-zero rows, each row
indicating which articles he authored.

.. GENERATED FROM PYTHON SOURCE LINES 92-95

.. code-block:: Python


    print(authors_mat[:, 1])





.. rst-class:: sphx-glr-script-out

 .. code-block:: none

      (1, 0)        1
      (7, 0)        1
      (22, 0)       1
      (60, 0)       1
      (128, 0)      1
      (136, 0)      1
      (150, 0)      1
      (179, 0)      1
      (205, 0)      1
      (212, 0)      1
      (233, 0)      1
      (238, 0)      1
      (241, 0)      1
      (246, 0)      1
      (262, 0)      1
      (282, 0)      1
      (294, 0)      1
      (356, 0)      1
      (358, 0)      1
      (359, 0)      1
      (363, 0)      1
      (371, 0)      1
      (372, 0)      1
      (409, 0)      1
      (498, 0)      1
      (501, 0)      1
      (536, 0)      1
      (541, 0)      1
      (542, 0)      1
      (878, 0)      1
      (893, 0)      1
      (1600, 0)     1
      (1717, 0)     1
      (2037, 0)     1
      (2075, 0)     1
      (2116, 0)     1
      (2222, 0)     1
      (2373, 0)     1
      (2449, 0)     1
      (2450, 0)     1
      (2611, 0)     1
      (2732, 0)     1
      (2976, 0)     1
      (2986, 0)     1
      (3107, 0)     1
      (3221, 0)     1
      (3791, 0)     1




.. GENERATED FROM PYTHON SOURCE LINES 96-101

Labs
--------

Similarly, we can create matrices for the labs by simply passing the
`structAcronym_s` column to the function.

.. GENERATED FROM PYTHON SOURCE LINES 101-108

.. code-block:: Python


    labs_mat, labs_scores = load_comma_separated_column(df,
                                                        'structAcronym_s',
                                                        filter_acronyms=True
                                                        )
    labs_scores.head()





.. rst-class:: sphx-glr-script-out

 .. code-block:: none


    LRI      4789
    UP11     6271
    CNRS    10217
    LISN     5203
    LMS         1
    dtype: int64



.. GENERATED FROM PYTHON SOURCE LINES 109-111

Checking the number of columns of the sparse matrix `labs_mat`, we see that
there are 1818 distict labs.

.. GENERATED FROM PYTHON SOURCE LINES 111-114

.. code-block:: Python


    labs_mat.shape[1]





.. rst-class:: sphx-glr-script-out

 .. code-block:: none


    1818



.. GENERATED FROM PYTHON SOURCE LINES 115-124

Filtering low score entities
--------------------------------------

A lot of the authors and labs that we just extracted from the dataframe have
a very low score, which means they're only linked to one or two articles. To
improve the quality of our data, we'll filter the authors and labs by
removing those that appear less than 4 times.

To do this, we'll use the `filter_min_score` function.

.. GENERATED FROM PYTHON SOURCE LINES 124-145

.. code-block:: Python


    from cartodata.operations import filter_min_score  # noqa

    authors_before = len(authors_scores)
    labs_before = len(labs_scores)

    authors_mat, authors_scores = filter_min_score(authors_mat, 
                                                   authors_scores, 
                                                   4)
    labs_mat, labs_scores = filter_min_score(labs_mat, 
                                             labs_scores, 
                                             4)

    print(f"Removed {authors_before - len(authors_scores)} authors with less "
          f"than 4 articles from a total of {authors_before} authors.")
    print(f"Working with {len(authors_scores)} authors.\n")

    print(f"Removed {labs_before - len(labs_scores)} labs with less than "
          f"4 articles from a total of {labs_before}.")
    print(f"Working with {len(labs_scores)} labs.")





.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    Removed 6654 authors with less than 4 articles from a total of 7348 authors.
    Working with 694 authors.

    Removed 1255 labs with less than 4 articles from a total of 1818.
    Working with 563 labs.




.. GENERATED FROM PYTHON SOURCE LINES 146-154

Words
----------

For the words, it's a bit trickier because we want to extract n-grams (groups
of n terms) instead of just comma separated values. We'll call the
`load_text_column` which uses scikit-learn's
`CountVectorizer <https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html>`_
to create a vocabulary and map the tokens.

.. GENERATED FROM PYTHON SOURCE LINES 154-173

.. code-block:: Python


    from cartodata.loading import load_text_column  # noqa
    from sklearn.feature_extraction import text as sktxt  # noqa

    with open('../datas/stopwords.txt', 'r') as stop_file:
        stopwords = sktxt.ENGLISH_STOP_WORDS.union(
            set(stop_file.read().splitlines()))

    df['text'] = df['en_abstract_s'] + ' ' \
        + df['en_keyword_s'].astype(str) + ' ' \
        + df['en_title_s'].astype(str) + ' ' \
        + df['en_domainAllCodeLabel_fs'].astype(str)

    words_mat, words_scores = load_text_column(df['text'],
                                               4,
                                               10,
                                               0.05,
                                               stopwords=stopwords)








.. GENERATED FROM PYTHON SOURCE LINES 174-176

Here `words_scores` contains a list of all the n-grams extracted from the
documents with their score,

.. GENERATED FROM PYTHON SOURCE LINES 176-179

.. code-block:: Python


    words_scores.head()





.. rst-class:: sphx-glr-script-out

 .. code-block:: none


    abilities     21
    ability      164
    absence       53
    absolute      19
    abstract     174
    dtype: int64



.. GENERATED FROM PYTHON SOURCE LINES 180-182

and the `words_mat` matrix counts the occurences of each of the 3457 n-grams
for all the articles.

.. GENERATED FROM PYTHON SOURCE LINES 182-185

.. code-block:: Python


    words_mat.shape





.. rst-class:: sphx-glr-script-out

 .. code-block:: none


    (4262, 4682)



.. GENERATED FROM PYTHON SOURCE LINES 186-193

To get a better representation of the importance of each term, we'll also
apply a TF-IDF (term-frequency times inverse document-frequency)
normalization on the matrix.

The `normalize_tfidf` simply calls scikit-learn's
`TfidfTransformer <https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html#sklearn.feature_extraction.text.TfidfTransformer>`_
class.

.. GENERATED FROM PYTHON SOURCE LINES 193-198

.. code-block:: Python


    from cartodata.operations import normalize_tfidf  # noqa

    words_mat = normalize_tfidf(words_mat)








.. GENERATED FROM PYTHON SOURCE LINES 199-203

Articles
------------

Finally, we need to create a matrix that simply maps each article to itself.

.. GENERATED FROM PYTHON SOURCE LINES 203-209

.. code-block:: Python


    from cartodata.loading import load_identity_column  # noqa

    articles_mat, articles_scores = load_identity_column(df, 'en_title_s')
    articles_scores.head()





.. rst-class:: sphx-glr-script-out

 .. code-block:: none


    Termination and Confluence of Higher-Order Rewrite Systems                                    1.0
    Efficient Self-stabilization                                                                  1.0
    Resource-bounded relational reasoning: induction and deduction through stochastic matching    1.0
    Reasoning about generalized intervals : Horn representability and tractability                1.0
    Proof Nets and Explicit Substitutions                                                         1.0
    dtype: float64



.. GENERATED FROM PYTHON SOURCE LINES 210-232

Dimension reduction
===================

One way to see the matrices that we created is as coordinates in the space of
all articles. What we want to do is to reduce the dimension of this space to
make it easier to work with and see.

Doc2vec projection
----------------------------

This example uses the Doc2vec technique to identify keywords in our data and
thus reduce the number of rows in our matrices. The `doc2vec_projection`
method takes at least four arguments:

- the number of dimensions you want to keep
- the matrix identifier of documents/words frequency
- a list of matrices to project
- the list of text to index (it is possible to add doc2vec parameters with the same syntax of the gensim doc2vec function).

It returns a list of matrices projected in the latent space.

We also apply an l2 normalization to each feature of the projected matrices.

.. GENERATED FROM PYTHON SOURCE LINES 232-243

.. code-block:: Python


    from cartodata.projection import doc2vec_projection  # noqa
    from cartodata.operations import normalize_l2  # noqa

    doc2vec_matrices = doc2vec_projection(50,
                                      2,
                                      [articles_mat, authors_mat, words_mat,
                                       labs_mat],
                                      df['text'])
    doc2vec_matrices = list(map(normalize_l2, doc2vec_matrices))








.. GENERATED FROM PYTHON SOURCE LINES 244-246

We've reduced the number of rows in each of `articles_mat`, `authors_mat`,
`words_mat` and `labs_mat` to just 50.

.. GENERATED FROM PYTHON SOURCE LINES 246-252

.. code-block:: Python


    print(f"articles_mat: {doc2vec_matrices[0].shape}")
    print(f"authors_mat: {doc2vec_matrices[1].shape}")
    print(f"words_mat: {doc2vec_matrices[2].shape}")
    print(f"labs_mat: {doc2vec_matrices[3].shape}")





.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    articles_mat: (50, 4262)
    authors_mat: (50, 694)
    words_mat: (50, 4682)
    labs_mat: (50, 563)




.. GENERATED FROM PYTHON SOURCE LINES 253-264

This makes it easier to work with them for clustering or nearest neighbors
tasks, but we also want to project them on a 2D space to be able to map them.

UMAP projection
------------------------

The `UMAP <https://github.com/lmcinnes/umap>`_ (Uniform Manifold Approximation
and Projection) is a dimension reduction technique that can be used for
visualisation similarly to t-SNE.

We use this algorithm to project our matrices in 2 dimensions.

.. GENERATED FROM PYTHON SOURCE LINES 264-269

.. code-block:: Python


    from cartodata.projection import umap_projection  # noqa

    umap_matrices = umap_projection(doc2vec_matrices)








.. GENERATED FROM PYTHON SOURCE LINES 270-272

Now that we have 2D coordinates for our points, we can try to plot them to
get a feel of the data's shape.

.. GENERATED FROM PYTHON SOURCE LINES 272-304

.. code-block:: Python


    import matplotlib.pyplot as plt  # noqa
    import numpy as np  # noqa
    import seaborn as sns  # noqa
    # %matplotlib inline

    sns.set(style='white', rc={'figure.figsize': (12, 8)})

    labels = ('article', "auth", "words", "labs")
    colors = ['g', 'r', 'b', 'y']
    markers = ['o', 's', '+', 'x']

    def plot(matrices):
        plt.close('all')
        fig, ax = plt.subplots()

        axes = []

        for i, m in enumerate(matrices):
            axes.append(ax.scatter(m[0, :], m[1, :], 
                                   color=colors[i], marker=markers[i],
                                   label = labels[i]))
                           
    
        leg = ax.legend((axes[0], axes[1], axes[2], axes[3]), 
                        labels, 
                        fancybox=True, shadow=True)
    
        return fig, ax

    fig, ax = plot(umap_matrices)




.. image-sg:: /auto_examples/images/sphx_glr_workflow_lisn_doc2vec_kmeans_001.png
   :alt: workflow lisn doc2vec kmeans
   :srcset: /auto_examples/images/sphx_glr_workflow_lisn_doc2vec_kmeans_001.png
   :class: sphx-glr-single-img





.. GENERATED FROM PYTHON SOURCE LINES 305-316

On the plot above, articles are shown in green, authors in red, words in blue
and labs in yellow. Because we don't have labels for the points, it doesn't
make much sense as is. But we can see that the data shows some clusters which
we could try to identify.

Clustering
==========

In order to identify clusters, we use the KMeans clustering technique on the
articles. We'll also try to label these clusters by selecting the most
frequent words that appear in each cluster's articles.

.. GENERATED FROM PYTHON SOURCE LINES 316-343

.. code-block:: Python


    from cartodata.clustering import create_kmeans_clusters  # noqa

    cluster_labels = []
    c_lda, c_umap, c_scores, c_knn, _, _, _ = create_kmeans_clusters(8,  # number of clusters to create
                                                            # 2D matrix of articles
                                                            umap_matrices[0],
                                                            # the 2D matrix of words
                                                            umap_matrices[2],
                                                            # the articles to words matrix
                                                            words_mat,
                                                            # word scores
                                                            words_scores,
                                                            # a list of initial cluster labels
                                                            cluster_labels,
                                                            # LDA space matrix of words
                                                            doc2vec_matrices[2])

    c_scores

    ""
    fig, ax = plot(umap_matrices)

    for i in range(8):
        ax.annotate(c_scores.index[i], (c_umap[0, i], c_umap[1, i]), 
                    color='red')




.. image-sg:: /auto_examples/images/sphx_glr_workflow_lisn_doc2vec_kmeans_002.png
   :alt: workflow lisn doc2vec kmeans
   :srcset: /auto_examples/images/sphx_glr_workflow_lisn_doc2vec_kmeans_002.png
   :class: sphx-glr-single-img





.. GENERATED FROM PYTHON SOURCE LINES 344-349

The 8 clusters that we created give us a general idea of what the big
clusters of data contain. But we'll probably want a finer level of detail if
we start to zoom in and focus on smaller areas. So we'll also create a second
bigger group of clusters. To do this, simply increase the number of clusters
we want.

.. GENERATED FROM PYTHON SOURCE LINES 349-359

.. code-block:: Python


    mc_lsa, mc_umap, mc_scores, mc_knn, _, _, _ = create_kmeans_clusters(32,
                                                                umap_matrices[0],
                                                                umap_matrices[2],
                                                                words_mat,
                                                                words_scores,
                                                                cluster_labels,
                                                                doc2vec_matrices[2])
    mc_scores





.. rst-class:: sphx-glr-script-out

 .. code-block:: none


    molecular, metabolism                                 94
    systems biology, regulatory networks                  88
    human interaction, interfaces                        198
    discrete, combinatorics                              176
    specification, programming languages                 161
    scientific workflows, information theoretic           99
    inference, evolutionary robotics                     169
    linear systems, discrete event systems               113
    stochastic, integer linear programming               131
    ontologies, information retrieval                    201
    modeling simulation, services                        111
    monte carlo, parameter                               175
    social sciences, community                           175
    neural networks, adaptation strategies               100
    networking internet, consumption                     170
    architectures, storage                               193
    abstract model, simulators                            63
    stabilizing, protocols                               105
    display, touch                                       172
    graph searching, maximum degree                       25
    queries, resource description framework              144
    large hadron collider, machine learning challenge     20
    participants, argue                                  135
    challenge, machine translation                       100
    secondary structure, structural                      102
    french, multilingual                                 119
    numerical simulations, fluid                         111
    vertices, minimum                                    129
    verification, proof assistant                        193
    convolutional neural networks, trained                91
    evolution strategies, testbed                        214
    representations, exploring                           185
    dtype: int64



.. GENERATED FROM PYTHON SOURCE LINES 360-377

Nearest neighbors
----------------------------

One more thing which could be useful to appreciate the quality of our data
would be to get each point's nearest neighbors. If our data processing is
done correctly, we expect the related articles, labs, words and authors to be
located close to each other.

Finding nearest neighbors is a common task with various algorithms aiming to
solve it. The `get_neighbors` method uses one of these algorithms to find the
nearest points of each type. It takes an optional weight parameter to tweak
the distance calculation to select points that have a higher score but are
maybe a bit farther instead of just selecting the closest neighbors.

Because we want to find the neighbors of each type (articles, authors, words,
labs) for all of the entities, we call the `get_neighbors` method in a loop
and store its results in an array.

.. GENERATED FROM PYTHON SOURCE LINES 377-390

.. code-block:: Python


    from cartodata.neighbors import get_neighbors  # noqa

    scores = [articles_scores, authors_scores, words_scores, labs_scores]
    weights = [0, 0.5, 0.5, 0]
    all_neighbors = []

    for idx in range(len(doc2vec_matrices)):
        all_neighbors.append(get_neighbors(doc2vec_matrices[idx],
                                           scores[idx],
                                           doc2vec_matrices,
                                           weights[idx]))








.. GENERATED FROM PYTHON SOURCE LINES 391-395

Exporting
---------------

We now have sufficient data to create a meaningfull visualization.

.. GENERATED FROM PYTHON SOURCE LINES 395-421

.. code-block:: Python


    from cartodata.operations import export_to_json  # noqa

    natures = ['articles',
               'authors',
               'words',
               'labs',
               'hl_clusters',
               'ml_clusters'
               ]
    export_file = '../datas/lisn_workflow_doc2vec.json'

    # add the clusters to list of 2d matrices and scores
    matrices = list(umap_matrices)
    matrices.extend([c_umap, mc_umap])
    scores.extend([c_scores, mc_scores])

    # create a json export file with all the infos
    export_to_json(natures,
                   matrices,
                   scores,
                   export_file,
                   neighbors_natures=natures[:4],
                   neighbors=all_neighbors)









.. GENERATED FROM PYTHON SOURCE LINES 422-425

This creates the `lisn_workflow_doc2vec.json` file which contains a list of
points ready to be imported into Cartolabe. Have a look at it to check that
it contains everything.

.. GENERATED FROM PYTHON SOURCE LINES 425-432

.. code-block:: Python


    import json  # noqa

    with open(export_file, 'r') as f:
        data = json.load(f)

    data[1]['position']




.. rst-class:: sphx-glr-script-out

 .. code-block:: none


    [4.724753379821777, 0.9927496314048767]




.. rst-class:: sphx-glr-timing

   **Total running time of the script:** (1 minutes 12.205 seconds)


.. _sphx_glr_download_auto_examples_workflow_lisn_doc2vec_kmeans.py:

.. only:: html

  .. container:: sphx-glr-footer sphx-glr-footer-example

    .. container:: sphx-glr-download sphx-glr-download-jupyter

      :download:`Download Jupyter notebook: workflow_lisn_doc2vec_kmeans.ipynb <workflow_lisn_doc2vec_kmeans.ipynb>`

    .. container:: sphx-glr-download sphx-glr-download-python

      :download:`Download Python source code: workflow_lisn_doc2vec_kmeans.py <workflow_lisn_doc2vec_kmeans.py>`

    .. container:: sphx-glr-download sphx-glr-download-zip

      :download:`Download zipped: workflow_lisn_doc2vec_kmeans.zip <workflow_lisn_doc2vec_kmeans.zip>`


.. only:: html

 .. rst-class:: sphx-glr-signature

    `Gallery generated by Sphinx-Gallery <https://sphinx-gallery.github.io>`_