.. DO NOT EDIT. .. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY. .. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE: .. "auto_examples/workflow_aligned_lisn_lsa_kmeans.py" .. LINE NUMBERS ARE GIVEN BELOW. .. only:: html .. note:: :class: sphx-glr-download-link-note :ref:`Go to the end <sphx_glr_download_auto_examples_workflow_aligned_lisn_lsa_kmeans.py>` to download the full example code. .. rst-class:: sphx-glr-example-title .. _sphx_glr_auto_examples_workflow_aligned_lisn_lsa_kmeans.py: Extracting and processing LISN data for Cartolabe with Aligned UMAP =========================================================================== (LSA projection) ================= In this example, we will: - extract entities (authors, articles, labs, words) from a collection of scientific articles - project those entities in 2 dimensions using aligned UMAP - cluster them - find their nearest neighbors. This example uses 6 datasets containing articles from HAL (https://hal.archives-ouvertes.fr/) published by authors from LISN (Laboratoire Interdisciplinaire des Sciences du Numérique). Each dataset contains cumulative data starting from 2010-2012 to 2010-2022 (inclusive). Using aligned UMAP will help us align UMAP embeddings for consecutive datasets. .. GENERATED FROM PYTHON SOURCE LINES 20-23 .. code-block:: Python # %matplotlib widget .. GENERATED FROM PYTHON SOURCE LINES 24-28 Download data ================ We will start by downloading 6 datasets from HAL. .. GENERATED FROM PYTHON SOURCE LINES 28-60 .. code-block:: Python from cartodata.scraping import scrape_hal, process_domain_column # noqa import os # noqa def fetch_data(from_year, to_year, struct_ids, struct='hal'): """ Fetch scientific publications of struct from HAL. """ filename = f"../datas/{struct.lower()}_{from_year}_{to_year - 1}.csv" if os.path.exists(filename): return filters = {} if struct: filters['structId_i'] = struct_ids years = range(from_year, to_year) df = scrape_hal(struct, filters, years, cool_down=2) process_domain_column(df) # Save the dataframe into a csv file df.to_csv(filename) return df # Fetch LRI data from year 2010 to 2022 inclusive for i in range(2, 14, 2): fetch_data(2010, 2011+i, "(1050003 2544)", 'lisn') .. GENERATED FROM PYTHON SOURCE LINES 61-65 Load data ================ Now we will load each downloaded dataset to a dataframe. .. GENERATED FROM PYTHON SOURCE LINES 65-85 .. code-block:: Python import pandas as pd # noqa def load_data(from_year, total, struct): dataframes = [] for i in range(total): file_name = f"../datas/{struct.lower()}_{from_year}_{from_year + ( 2 + 2 * i)}.csv" print(file_name) df = pd.read_csv(file_name, index_col=0) dataframes.append(df) return dataframes dataframes = load_data(2010, 6, "lisn") dataframes[0].head() .. rst-class:: sphx-glr-script-out .. code-block:: none ../datas/lisn_2010_2012.csv ../datas/lisn_2010_2014.csv ../datas/lisn_2010_2016.csv ../datas/lisn_2010_2018.csv ../datas/lisn_2010_2020.csv ../datas/lisn_2010_2022.csv .. raw:: html <div class="output_subarea output_html rendered_html output_result"> <div> <style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </style> <table border="1" class="dataframe"> <thead> <tr style="text-align: right;"> <th></th> <th>structId_i</th> <th>authFullName_s</th> <th>en_abstract_s</th> <th>en_keyword_s</th> <th>en_title_s</th> <th>structAcronym_s</th> <th>producedDateY_i</th> <th>producedDateM_i</th> <th>halId_s</th> <th>docid</th> <th>en_domainAllCodeLabel_fs</th> </tr> </thead> <tbody> <tr> <th>0</th> <td>[2544, 92966, 411575, 441569, 56057, 2544, 929...</td> <td>Olivier Teytaud</td> <td>This document is devoted to artificial intelli...</td> <td>Evolutionary optimization,Parallelism</td> <td>Artificial Intelligence and Optimization with ...</td> <td>LRI,UP11,CNRS,TAO,LRI,UP11,CNRS,Inria,LISN</td> <td>2010</td> <td>11.0</td> <td>tel-01078099</td> <td>1078099</td> <td>Optimization and Control,Mathematics</td> </tr> <tr> <th>1</th> <td>[2544, 92966, 411575, 441569, 56057, 2544, 929...</td> <td>J.-B Hoock,O Teytaud</td> <td>When looking for relevant mutations of a learn...</td> <td>Mots-clés Reinforcement learning,Monte-Carlo ...</td> <td>Bandit-Based Genetic Programming with Applicat...</td> <td>LRI,UP11,CNRS,TAO,LRI,UP11,CNRS,Inria,LISN</td> <td>2010</td> <td>5.0</td> <td>hal-01098456</td> <td>1098456</td> <td>Computer Science and Game Theory,Computer Science</td> </tr> <tr> <th>2</th> <td>[16574, 179741, 300351, 56050, 2544, 92966, 41...</td> <td>Pierre Cubaud,Alexandre Topol,Emmanuel Pietriga</td> <td>This document provides feedback about the curr...</td> <td>NaN</td> <td>ALMA Graphical User Interfaces</td> <td>CEDRIC,ENSIIE,CNAM,IN-SITU,LRI,UP11,CNRS,Inria...</td> <td>2010</td> <td>3.0</td> <td>hal-01126094</td> <td>1126094</td> <td>Computer Interaction,Computer Science</td> </tr> <tr> <th>3</th> <td>[2544, 92966, 411575, 441569, 3210, 301243, 30...</td> <td>Dominique Gouyou-Beauchamps,Cyril Nicaud</td> <td>Generalizing an idea used by Alonso to generat...</td> <td>Random generation,Binomial distribution</td> <td>Random Generation Using Binomial Approximations</td> <td>LRI,UP11,CNRS,LIGM,UPEM,ENPC,BEZOUT,CNRS,CNRS,...</td> <td>2010</td> <td>NaN</td> <td>hal-01185570</td> <td>1185570</td> <td>Computational Geometry,Computer Science</td> </tr> <tr> <th>4</th> <td>[391379, 233, 93591, 441569, 103784, 2071, 300...</td> <td>Olivier Bodini,Yann Ponty</td> <td>We address the uniform random generation of wo...</td> <td>Boltzmann sampling,Context-free languages,Rand...</td> <td>Multi-dimensional Boltzmann Sampling of Languages</td> <td>APR,LIP6,UPMC,CNRS,AMIB,LIX,X,IP Paris,CNRS,LR...</td> <td>2010</td> <td>6.0</td> <td>hal-00450763</td> <td>1185592</td> <td>Computational Geometry,Computer Science</td> </tr> </tbody> </table> </div> </div> <br /> <br /> .. GENERATED FROM PYTHON SOURCE LINES 86-87 We can list the total number of articles starting from 2010 to a certain year. .. GENERATED FROM PYTHON SOURCE LINES 87-91 .. code-block:: Python for i in range(6): print(f"{2010 + (i*2) + 2} => {dataframes[i].shape[0]}") .. rst-class:: sphx-glr-script-out .. code-block:: none 2012 => 765 2014 => 1361 2016 => 1983 2018 => 2514 2020 => 2945 2022 => 2987 .. GENERATED FROM PYTHON SOURCE LINES 92-97 Creating correspondance matrices for each entity type ================================================================ From this table of articles, we want to extract matrices that will map the correspondance between these articles and the entities we want to use. .. GENERATED FROM PYTHON SOURCE LINES 97-114 .. code-block:: Python from cartodata.loading import load_comma_separated_column # noqa from cartodata.loading import load_identity_column # noqa def create_correspondance_matrices(func, dataframes, column_name): matrices = [] scores = [] for i in range(len(dataframes)): mat, score = func(dataframes[i], column_name) matrices.append(mat) scores.append(score) return matrices, scores .. GENERATED FROM PYTHON SOURCE LINES 115-123 The `load_comma_separated_column` function takes in a `dataframe` and the `name of a column` and returns two objects: - a sparse matrix - a pandas Series Each column of the sparse matrix, corresponds to the entity specified by `column_name` and each row corresponds to an article. .. GENERATED FROM PYTHON SOURCE LINES 125-134 Authors -------------- Let's start with the authors for example. We want to create a matrix where the rows represent the articles and the columns represent the authors. Each cell (n, m) will have a 1 in it if the *nth* article was written by the *mth* author. As we have multiple dataframes, the results will be arrays corresponding to specified dataframes. .. GENERATED FROM PYTHON SOURCE LINES 134-139 .. code-block:: Python matrices_auth, scores_auth = create_correspondance_matrices( load_comma_separated_column, dataframes, 'authFullName_s' ) .. GENERATED FROM PYTHON SOURCE LINES 140-141 We can see the number of (articles, authors) for each year: .. GENERATED FROM PYTHON SOURCE LINES 141-145 .. code-block:: Python for i in range(len(matrices_auth)): print(f"{2010 + (i*2) + 2} => {matrices_auth[i].shape}") .. rst-class:: sphx-glr-script-out .. code-block:: none 2012 => (765, 1666) 2014 => (1361, 2338) 2016 => (1983, 3135) 2018 => (2514, 3882) 2020 => (2945, 4554) 2022 => (2987, 4652) .. GENERATED FROM PYTHON SOURCE LINES 146-151 The `score` in `scores_auth` is a series that contain the list of authors extracted from the column `authFullName_s` with a score that is equal to the number of rows (articles) that this value was mapped within the corresponding matrix in `matrices_auth`. Let's have a look at the number of articles per author for each year: .. GENERATED FROM PYTHON SOURCE LINES 151-161 .. code-block:: Python df_scores_auth = pd.concat(scores_auth, axis=1).reset_index().rename( columns={i: f"{2010 + (i*2) + 2}" for i in range(6)} ) df_scores_auth.head(20) "" df_scores_auth.describe() .. raw:: html <div class="output_subarea output_html rendered_html output_result"> <div> <style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </style> <table border="1" class="dataframe"> <thead> <tr style="text-align: right;"> <th></th> <th>2012</th> <th>2014</th> <th>2016</th> <th>2018</th> <th>2020</th> <th>2022</th> </tr> </thead> <tbody> <tr> <th>count</th> <td>1666.000000</td> <td>2338.000000</td> <td>3135.000000</td> <td>3882.000000</td> <td>4554.000000</td> <td>4652.00000</td> </tr> <tr> <th>mean</th> <td>2.206483</td> <td>2.465783</td> <td>2.590112</td> <td>2.685987</td> <td>2.705314</td> <td>2.69239</td> </tr> <tr> <th>std</th> <td>3.080791</td> <td>4.044985</td> <td>4.862247</td> <td>5.312939</td> <td>5.627630</td> <td>5.61770</td> </tr> <tr> <th>min</th> <td>1.000000</td> <td>1.000000</td> <td>1.000000</td> <td>1.000000</td> <td>1.000000</td> <td>1.00000</td> </tr> <tr> <th>25%</th> <td>1.000000</td> <td>1.000000</td> <td>1.000000</td> <td>1.000000</td> <td>1.000000</td> <td>1.00000</td> </tr> <tr> <th>50%</th> <td>1.000000</td> <td>1.000000</td> <td>1.000000</td> <td>1.000000</td> <td>1.000000</td> <td>1.00000</td> </tr> <tr> <th>75%</th> <td>2.000000</td> <td>3.000000</td> <td>2.000000</td> <td>2.000000</td> <td>2.000000</td> <td>2.00000</td> </tr> <tr> <th>max</th> <td>41.000000</td> <td>60.000000</td> <td>77.000000</td> <td>94.000000</td> <td>113.000000</td> <td>113.00000</td> </tr> </tbody> </table> </div> </div> <br /> <br /> .. GENERATED FROM PYTHON SOURCE LINES 162-167 Labs ---------- Similarly, we can create matrices for the labs by simply passing the `structAcronym_s` column to the function. .. GENERATED FROM PYTHON SOURCE LINES 167-171 .. code-block:: Python matrices_labs, scores_labs = create_correspondance_matrices( load_comma_separated_column, dataframes, 'structAcronym_s') .. GENERATED FROM PYTHON SOURCE LINES 172-173 We can see the number of (articles, labs) for each year: .. GENERATED FROM PYTHON SOURCE LINES 173-177 .. code-block:: Python for i in range(len(matrices_labs)): print(f"{2010 + (i*2) + 2} => {matrices_labs[i].shape}") .. rst-class:: sphx-glr-script-out .. code-block:: none 2012 => (765, 602) 2014 => (1361, 838) 2016 => (1983, 1041) 2018 => (2514, 1254) 2020 => (2945, 1426) 2022 => (2987, 1447) .. GENERATED FROM PYTHON SOURCE LINES 178-179 Let's have a look at the number of articles per author for each year: .. GENERATED FROM PYTHON SOURCE LINES 179-189 .. code-block:: Python df_scores_labs = pd.concat(scores_labs, axis=1).reset_index().rename( columns={i: f"{2010 + (i*2) + 2}" for i in range(10)} ) df_scores_labs.head(20) "" df_scores_labs.describe() .. raw:: html <div class="output_subarea output_html rendered_html output_result"> <div> <style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </style> <table border="1" class="dataframe"> <thead> <tr style="text-align: right;"> <th></th> <th>2012</th> <th>2014</th> <th>2016</th> <th>2018</th> <th>2020</th> <th>2022</th> </tr> </thead> <tbody> <tr> <th>count</th> <td>602.000000</td> <td>838.00000</td> <td>1041.000000</td> <td>1254.000000</td> <td>1426.000000</td> <td>1447.000000</td> </tr> <tr> <th>mean</th> <td>15.131229</td> <td>19.21599</td> <td>22.337176</td> <td>23.566986</td> <td>24.131837</td> <td>24.230822</td> </tr> <tr> <th>std</th> <td>106.106441</td> <td>159.00098</td> <td>201.821286</td> <td>230.275677</td> <td>248.591470</td> <td>250.370755</td> </tr> <tr> <th>min</th> <td>1.000000</td> <td>1.00000</td> <td>1.000000</td> <td>1.000000</td> <td>1.000000</td> <td>1.000000</td> </tr> <tr> <th>25%</th> <td>1.000000</td> <td>1.00000</td> <td>1.000000</td> <td>1.000000</td> <td>1.000000</td> <td>1.000000</td> </tr> <tr> <th>50%</th> <td>2.000000</td> <td>2.00000</td> <td>2.000000</td> <td>2.000000</td> <td>2.000000</td> <td>2.000000</td> </tr> <tr> <th>75%</th> <td>4.000000</td> <td>5.00000</td> <td>6.000000</td> <td>6.000000</td> <td>6.000000</td> <td>7.000000</td> </tr> <tr> <th>max</th> <td>1712.000000</td> <td>3001.00000</td> <td>4218.000000</td> <td>5300.000000</td> <td>6183.000000</td> <td>6297.000000</td> </tr> </tbody> </table> </div> </div> <br /> <br /> .. GENERATED FROM PYTHON SOURCE LINES 190-197 Filtering low score entities =============================== A lot of the authors and labs that we just extracted from the dataframe have a very low score, which means they're only linked to one or two articles. To improve the quality of our data, we'll filter the entities by removing those that appear less than certain number of times. .. GENERATED FROM PYTHON SOURCE LINES 197-221 .. code-block:: Python from cartodata.operations import filter_min_score # noqa def filter_low_score(matrices, scores, count, entity): filtered_matrices = [] filtered_scores = [] for i, (matrix, score) in enumerate(zip(matrices, scores)): before = len(score) filtered_mat, filtered_score = filter_min_score(matrix, score, count) filtered_matrices.append(filtered_mat) filtered_scores.append(filtered_score) print(f'Removed {before - len(filtered_score)} {entity}s with less than ' + f'{count} articles from a total of {before} {entity}s.') print(f'Working with {len(filtered_score)} {entity} for year {2010 + (i*2) + 2}.\n') return filtered_matrices, filtered_scores .. GENERATED FROM PYTHON SOURCE LINES 222-226 Authors ------------- We will remove the authors with less than 5 publications. .. GENERATED FROM PYTHON SOURCE LINES 226-232 .. code-block:: Python auth_pub_count = 5 filtered_auth_matrices, filtered_auth_scores = filter_low_score( matrices_auth, scores_auth, auth_pub_count, "author") .. rst-class:: sphx-glr-script-out .. code-block:: none Removed 1578 authors with less than 5 articles from a total of 1666 authors. Working with 88 author for year 2012. Removed 2179 authors with less than 5 articles from a total of 2338 authors. Working with 159 author for year 2014. Removed 2887 authors with less than 5 articles from a total of 3135 authors. Working with 248 author for year 2016. Removed 3566 authors with less than 5 articles from a total of 3882 authors. Working with 316 author for year 2018. Removed 4168 authors with less than 5 articles from a total of 4554 authors. Working with 386 author for year 2020. Removed 4258 authors with less than 5 articles from a total of 4652 authors. Working with 394 author for year 2022. .. GENERATED FROM PYTHON SOURCE LINES 233-237 Labs --------- We will remove the labs with less than 20 publications. .. GENERATED FROM PYTHON SOURCE LINES 237-243 .. code-block:: Python lab_pub_count = 20 filtered_lab_matrices, filtered_lab_scores = filter_low_score( matrices_labs, scores_labs, lab_pub_count, "labs") .. rst-class:: sphx-glr-script-out .. code-block:: none Removed 566 labss with less than 20 articles from a total of 602 labss. Working with 36 labs for year 2012. Removed 762 labss with less than 20 articles from a total of 838 labss. Working with 76 labs for year 2014. Removed 938 labss with less than 20 articles from a total of 1041 labss. Working with 103 labs for year 2016. Removed 1131 labss with less than 20 articles from a total of 1254 labss. Working with 123 labs for year 2018. Removed 1286 labss with less than 20 articles from a total of 1426 labss. Working with 140 labs for year 2020. Removed 1305 labss with less than 20 articles from a total of 1447 labss. Working with 142 labs for year 2022. .. GENERATED FROM PYTHON SOURCE LINES 244-252 Words -------------- For the words, it's a bit trickier because we want to extract n-grams (groups of n terms) instead of just comma separated values. We'll call the load_text_column which uses scikit-learn's CountVectorizer <https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html>_ to create a vocabulary and map the tokens. .. GENERATED FROM PYTHON SOURCE LINES 252-289 .. code-block:: Python from cartodata.loading import load_text_column # noqa from sklearn.feature_extraction import text as sktxt # noqa from cartodata.operations import normalize_tfidf # noqa def process_words(dataframes, stopwords): words_matrices = [] words_scores = [] for df in dataframes: df['text'] = df['en_abstract_s'] + ' ' \ + df['en_keyword_s'].astype(str) + ' ' \ + df['en_title_s'].astype(str) + ' ' \ + df['en_domainAllCodeLabel_fs'].astype(str) words_mat, words_score = load_text_column(df['text'], 4, 10, 0.05, stopwords=stopwords) # apply term-frequency times inverse document-frequency normalization words_mat = normalize_tfidf(words_mat) words_matrices.append(words_mat) words_scores.append(words_score) return words_matrices, words_scores with open('../datas/stopwords.txt', 'r') as stop_file: stopwords = sktxt.ENGLISH_STOP_WORDS.union( set(stop_file.read().splitlines())) words_matrices, words_scores = process_words(dataframes, stopwords) .. GENERATED FROM PYTHON SOURCE LINES 290-291 We can list number of terms per article for each year: .. GENERATED FROM PYTHON SOURCE LINES 291-295 .. code-block:: Python for i in range(len(words_matrices)): print(f"{2010 + (i*2) + 2} => {words_matrices[i].shape}") .. rst-class:: sphx-glr-script-out .. code-block:: none 2012 => (765, 1015) 2014 => (1361, 1801) 2016 => (1983, 2483) 2018 => (2514, 3045) 2020 => (2945, 3512) 2022 => (2987, 3560) .. GENERATED FROM PYTHON SOURCE LINES 296-300 Articles ------------ Finally, we need to create a matrix that simply maps each article to itself. .. GENERATED FROM PYTHON SOURCE LINES 300-308 .. code-block:: Python matrices_article, scores_article = create_correspondance_matrices( load_identity_column, dataframes, 'en_title_s') "" for i in range(len(matrices_article)): print(f"{2010 + (i*2) + 2} => {matrices_article[i].shape}") .. rst-class:: sphx-glr-script-out .. code-block:: none 2012 => (765, 765) 2014 => (1361, 1361) 2016 => (1983, 1983) 2018 => (2514, 2514) 2020 => (2945, 2945) 2022 => (2987, 2987) .. GENERATED FROM PYTHON SOURCE LINES 309-331 Dimension reduction ==================== One way to see the matrices that we created is as coordinates in the space of all articles. What we want to do is to reduce the dimension of this space to make it easier to work with and see. LSA projection -------------------------- We'll start by using the LSA (Latent Semantic Analysis) technique to identify keywords in our data and thus reduce the number of rows in our matrices. The `lsa_projection` method takes three arguments: - the number of dimensions you want to keep - the matrix of documents/words frequency - a list of matrices to project It returns a list of the same length containing the matrices projected in the latent space. We also apply an l2 normalization to each feature of the projected matrices. .. GENERATED FROM PYTHON SOURCE LINES 331-360 .. code-block:: Python from cartodata.projection import lsa_projection # noqa from cartodata.operations import normalize_l2 # noqa dimensions = [12, 12, 14, 16, 20, 20] def run_lsa_projection(articles_matrices, auth_matrices, words_matrices, lab_matrices, dimensions): lsa_matrices = [] for articles_mat, auth_mat, words_mat, labs_mat, dim in zip( articles_matrices, auth_matrices, words_matrices, lab_matrices, dimensions ): lsa_mat = lsa_projection( dim, words_mat, [articles_mat, auth_mat, words_mat, labs_mat] ) lsa_mat = list(map(normalize_l2, lsa_mat)) lsa_matrices.append(lsa_mat) return lsa_matrices lsa_matrices = run_lsa_projection(matrices_article, filtered_auth_matrices, words_matrices, filtered_lab_matrices, dimensions) .. GENERATED FROM PYTHON SOURCE LINES 361-362 List (number_of_articles, dimensions) per year: .. GENERATED FROM PYTHON SOURCE LINES 362-366 .. code-block:: Python for i in range(len(lsa_matrices)): print(f"{2010 + (i*2) + 2} => {lsa_matrices[i][0].shape}") .. rst-class:: sphx-glr-script-out .. code-block:: none 2012 => (12, 765) 2014 => (12, 1361) 2016 => (14, 1983) 2018 => (16, 2514) 2020 => (20, 2945) 2022 => (20, 2987) .. GENERATED FROM PYTHON SOURCE LINES 367-368 List (number_of_authors, dimensions) per year: .. GENERATED FROM PYTHON SOURCE LINES 368-372 .. code-block:: Python for i in range(len(lsa_matrices)): print(f"{2010 + (i*2) + 2} => {lsa_matrices[i][1].shape}") .. rst-class:: sphx-glr-script-out .. code-block:: none 2012 => (12, 88) 2014 => (12, 159) 2016 => (14, 248) 2018 => (16, 316) 2020 => (20, 386) 2022 => (20, 394) .. GENERATED FROM PYTHON SOURCE LINES 373-374 List (number_of_words, dimensions) per year: .. GENERATED FROM PYTHON SOURCE LINES 374-378 .. code-block:: Python for i in range(len(lsa_matrices)): print(f"{2010 + (i*2) + 2} => {lsa_matrices[i][2].shape}") .. rst-class:: sphx-glr-script-out .. code-block:: none 2012 => (12, 1015) 2014 => (12, 1801) 2016 => (14, 2483) 2018 => (16, 3045) 2020 => (20, 3512) 2022 => (20, 3560) .. GENERATED FROM PYTHON SOURCE LINES 379-380 List (number_of_labs, dimensions) per year: .. GENERATED FROM PYTHON SOURCE LINES 380-384 .. code-block:: Python for i in range(len(lsa_matrices)): print(f"{2010 + (i*2) + 2} => {lsa_matrices[i][3].shape}") .. rst-class:: sphx-glr-script-out .. code-block:: none 2012 => (12, 36) 2014 => (12, 76) 2016 => (14, 103) 2018 => (16, 123) 2020 => (20, 140) 2022 => (20, 142) .. GENERATED FROM PYTHON SOURCE LINES 385-391 Aligned UMAP projection =========================== To use aligned UMAP, we need to create relations between each consecutive dataset that maps each entity index (article, author, word, lab) in one dataset to corresponding entity index in the following dataset. We will also create color mappings to view each entity in consecutive maps with the same color. .. GENERATED FROM PYTHON SOURCE LINES 391-463 .. code-block:: Python import numpy as np # noqa def make_relation(from_df, to_df): # create a new dataframe with index from from_df, and values as integers # starting from 0 to the length of from_df left = pd.DataFrame(data=np.arange(len(from_df)), index=from_df.index) # create a new dataframe with index from to_df, and values as integers # starting from 0 to the length of to_df right = pd.DataFrame(data=np.arange(len(to_df)), index=to_df.index) # merge left and right dataframes on the intersection of keys of both # dataframes preserving the order of left keys merge = pd.merge(left, right, how="inner", left_index=True, right_index=True) return dict(merge.values) def generate_relations(filtered_scores): relations = [] for i in range(len(filtered_scores) - 1): relation = make_relation(filtered_scores[i], filtered_scores[i+1]) relations.append(relation) return relations def generate_colors(filtered_scores): colors = [] for i in range(len(filtered_scores)): color = make_relation(filtered_scores[i], filtered_scores[-1]) colors.append(color) return colors def concat_scores(scores_article, scores_auth, scores_word, scores_lab): """Concatenates article, auth, words and labs score for each year and returns the concatenated frames in a list. """ concatenated_scores = [] for score_article, score_auth, score_word, score_lab in zip( scores_article, scores_auth, scores_word, scores_lab ): concatenated_scores.append( pd.concat([score_article, score_auth, score_word, score_lab])) return concatenated_scores filtered_scores = concat_scores( scores_article, filtered_auth_scores, words_scores, filtered_lab_scores) relations = generate_relations(filtered_scores) colors_auth = generate_colors(filtered_auth_scores) colors_words = generate_colors(words_scores) colors_lab = generate_colors(filtered_lab_scores) colors_articles = generate_colors(scores_article) .. GENERATED FROM PYTHON SOURCE LINES 464-465 We should get transpose of each lsa matrix to be able to process by UMAP. .. GENERATED FROM PYTHON SOURCE LINES 465-477 .. code-block:: Python def get_transpose(lsa_matrices): transposed_mats = [] for lsa in lsa_matrices: transposed_mats.append(np.hstack(lsa).T) return transposed_mats transposed_mats = get_transpose(lsa_matrices) .. GENERATED FROM PYTHON SOURCE LINES 478-479 We create an AlignedUMAP instance to generate 6 maps for the 6 datasets. .. GENERATED FROM PYTHON SOURCE LINES 479-495 .. code-block:: Python from umap.aligned_umap import AlignedUMAP # noqa n_neighbors = [10, 30, 30, 50, 70, 100] min_dists = [0.1, 0.3, 0.3, 0.5, 0.7, 0.7] spreads = [1, 1, 1, 1, 1, 1] n = 6 reducer = AlignedUMAP( n_neighbors=n_neighbors[:n], min_dist=min_dists[:n], # n_neighbors=10, # min_dist=0.05, init='random', n_epochs=200) .. GENERATED FROM PYTHON SOURCE LINES 496-497 Aligned UMAP requires at least two datasets and the mapping between the entities of the datasets. We can change the number of initial datasets changing the value of the variable `n`. .. GENERATED FROM PYTHON SOURCE LINES 497-501 .. code-block:: Python reducer.fit_transform(transposed_mats[:n], relations=relations[:n-1]) .. rst-class:: sphx-glr-script-out .. code-block:: none ListType[array(float32, 2d, C)]([[[-4.6312943 2.1897452 ] [-4.7019587 2.4623086 ] [ 3.172899 -1.254874 ] ... [-0.44537532 0.0364257 ] [-0.3615355 0.10424785] [ 0.9505725 -4.1422253 ]], [[-4.4378047 1.8940356] [-4.6877837 2.1463704] [ 1.8862357 -5.3941364] ... [ 3.8317637 -3.2522857] [ 1.7630607 2.5508578] [ 1.7112528 2.5630774]], [[-4.4434514 1.7681804] [-4.7892733 2.054415 ] [ 1.5745841 -5.694587 ] ... [-2.4695678 0.777273 ] [ 1.6377203 2.8798878] [ 1.9659495 2.442149 ]], [[-4.6376925 1.53457 ] [-4.694312 1.5876851 ] [ 2.19716 -5.3580184 ] ... [-3.5690544 -1.2518331 ] [ 2.4115925 -5.407163 ] [ 0.04822341 -5.1132517 ]], [[-4.5427794 1.2628535 ] [-4.617718 1.3386154 ] [ 2.430242 -5.260712 ] ... [ 0.05606826 -5.724693 ] [-1.1132172 -3.6751866 ] [-1.1428653 0.9819731 ]], [[-4.5396147 1.2146386 ] [-4.6235642 1.2929305 ] [ 2.5026178 -5.452615 ] ... [-0.9563716 -3.7903018 ] [ 1.0241965 0.85914576] [ 0.7459364 -3.1982822 ]]]) .. GENERATED FROM PYTHON SOURCE LINES 502-503 Then we will generate maps for the remaining dataset one by one by feeding the reducer with the corresponding matrix and relation dictionary. .. GENERATED FROM PYTHON SOURCE LINES 503-522 .. code-block:: Python def update_reducer(reducer, matrices, relations, n_neighbors, min_dists, spreads): for mat, rel, n_neighbor, min_dist, spread in zip( matrices, relations, n_neighbors, min_dists, spreads ): # reducer.update(mat, relations=rel) reducer.update(mat, relations=rel, n_neighbors=n_neighbor, min_dist=min_dist, spread=spread, verbose=True) update_reducer(reducer, transposed_mats[n:], relations[n-1:], n_neighbors[n:], min_dists[n:], spreads[n:]) "" for embedding in reducer.embeddings_: print(embedding.shape) .. rst-class:: sphx-glr-script-out .. code-block:: none (1904, 2) (3397, 2) (4817, 2) (5998, 2) (6983, 2) (7083, 2) .. GENERATED FROM PYTHON SOURCE LINES 523-527 Pickle reducer =============== It is possible to serialize and save the reducer object for later use. .. GENERATED FROM PYTHON SOURCE LINES 527-550 .. code-block:: Python # From https://github.com/lmcinnes/umap/issues/672 from numba.typed import List # noqa import pickle # noqa params = reducer.get_params() attributes_names = [attr for attr in reducer.__dir__( ) if attr not in params and attr[0] != '_'] attributes = {key: reducer.__getattribute__(key) for key in attributes_names} attributes['embeddings_'] = list(reducer.embeddings_) for x in ['fit', 'fit_transform', 'update', 'get_params', 'set_params', "get_metadata_routing"]: del attributes[x] all_params = { 'umap_params': params, 'umap_attributes': {key: value for key, value in attributes.items()} } pickle.dump(all_params, open('pickled_reducer.pkl', 'wb')) .. GENERATED FROM PYTHON SOURCE LINES 551-555 Reload from pickle ==================== We can reload the reducer: .. GENERATED FROM PYTHON SOURCE LINES 555-566 .. code-block:: Python params_new = pickle.load(open('pickled_reducer.pkl', 'rb')) reducer = AlignedUMAP() reducer.set_params(**params_new.get('umap_params')) for attr, value in params_new.get('umap_attributes').items(): reducer.__setattr__(attr, value) reducer.__setattr__('embeddings_', List( params_new.get('umap_attributes').get('embeddings_'))) .. GENERATED FROM PYTHON SOURCE LINES 567-571 Plot maps ============= To plot the maps, we will reorganize the data. .. GENERATED FROM PYTHON SOURCE LINES 571-608 .. code-block:: Python def update_embeddings_with_entities(embeddings, filtered_scores): r_embeddings = [] for embedding, score in zip(embeddings, filtered_scores): score_indices = np.matrix(score.index).T embd = np.hstack((embedding, score_indices)).T r_embeddings.append(np.asarray(embd)) return r_embeddings def decompose_embeddings(embeddings): article_embeddings = [] auth_embeddings = [] word_embeddings = [] lab_embeddings = [] for i, embedding in enumerate(embeddings): len_article = len(scores_article[i]) len_auth = len(filtered_auth_scores[i]) len_word = len(words_scores[i]) article_embeddings.append(embeddings[i][:len_article]) auth_embeddings.append( embeddings[i][len_article: len_article + len_auth]) word_embeddings.append( embeddings[i][len_article + len_auth:len_article + len_auth + len_word]) lab_embeddings.append( embeddings[i][len_article + len_auth + len_word:]) return article_embeddings, auth_embeddings, word_embeddings, lab_embeddings article_embeddings, auth_embeddings, word_embeddings, lab_embeddings = decompose_embeddings( reducer.embeddings_) .. GENERATED FROM PYTHON SOURCE LINES 609-610 We will add values to embeddings, so that we can visualize article title, author name, word or lab name when the entity is hovered on the plot. .. GENERATED FROM PYTHON SOURCE LINES 610-627 .. code-block:: Python article_embeddings_names = update_embeddings_with_entities( article_embeddings, scores_article ) auth_embeddings_names = update_embeddings_with_entities( auth_embeddings, filtered_auth_scores ) word_embeddings_names = update_embeddings_with_entities( word_embeddings, words_scores ) lab_embeddings_names = update_embeddings_with_entities( lab_embeddings, filtered_lab_scores ) .. GENERATED FROM PYTHON SOURCE LINES 628-629 We will create a colormap that will map each entity (article, author, word, lab) to same color in every plot. .. GENERATED FROM PYTHON SOURCE LINES 629-643 .. code-block:: Python from matplotlib import cm # noqa def create_color_map(max_list): cmap = cm.get_cmap('gist_ncar', len(max_list)) return cmap cmap_article = create_color_map(scores_article[-1]) cmap_auth = create_color_map(filtered_auth_scores[-1]) cmap_word = create_color_map(words_scores[-1]) cmap_lab = create_color_map(filtered_lab_scores[-1]) .. rst-class:: sphx-glr-script-out .. code-block:: none /builds/2mk6rsew/0/hgozukan/cartolabe-data/examples/workflow_aligned_lisn_lsa_kmeans.py:634: MatplotlibDeprecationWarning: The get_cmap function was deprecated in Matplotlib 3.7 and will be removed in 3.11. Use ``matplotlib.colormaps[name]`` or ``matplotlib.colormaps.get_cmap()`` or ``pyplot.get_cmap()`` instead. cmap = cm.get_cmap('gist_ncar', len(max_list)) .. GENERATED FROM PYTHON SOURCE LINES 644-645 Now we can plot all the maps. .. GENERATED FROM PYTHON SOURCE LINES 645-735 .. code-block:: Python import matplotlib.pyplot as plt # noqa import numpy as np # noqa import mplcursors # noqa plt.close('all') def axis_bounds(embedding): left, right = embedding.T[0].min(), embedding.T[0].max() bottom, top = embedding.T[1].min(), embedding.T[1].max() adj_h, adj_v = (right - left) * 0.1, (top - bottom) * 0.1 return [left - adj_h, right + adj_h, bottom - adj_v, top + adj_v] ax_bound = axis_bounds(np.vstack(reducer.embeddings_)) ax_bound def plot(title, embedding_article, embedding_auth, embedding_word, embedding_lab, colors_article, colors_auth, colors_word, colors_lab, c_scores=None, c_umap=None, n_clusters=None): colors_article_list = list(colors_article.values()) colors_auth_list = list(colors_auth.values()) colors_word_list = list(colors_word.values()) colors_lab_list = list(colors_lab.values()) fig, ax = plt.subplots(1, 1, figsize=(10, 9)) ax.set_title(title) ax.axis(ax_bound) article = ax.scatter(embedding_article[0], embedding_article[1], c=colors_article_list, cmap=cmap_article, vmin=0, vmax=len(scores_article[-1]), marker='x', label="article") auth = ax.scatter(embedding_auth[0], embedding_auth[1], c=colors_auth_list, cmap=cmap_auth, vmin=0, vmax=len(filtered_auth_scores[-1]), label="auth") word = ax.scatter(embedding_word[0], embedding_word[1], vmin=0, vmax=len(words_scores[-1]), marker='+', label="word") lab = ax.scatter(embedding_lab[0], embedding_lab[1], c=colors_lab_list, cmap=cmap_lab, vmin=0, vmax=len(filtered_lab_scores[-1]), marker='s', label="lab") # from https://matplotlib.org/stable/gallery/event_handling/legend_picking.html leg = ax.legend((article, auth, word, lab), ("article", "auth", "words", "labs"), fancybox=True, shadow=True) if n_clusters is not None: for i in range(n_clusters): ax.annotate(c_scores.index[i], (c_umap[0, i], c_umap[1, i])) # crs_article = mplcursors.cursor(article, hover=True) # @crs_article.connect("add") # def on_add(sel): # sel.annotation.set(text=embedding_article[2][sel.target.index]) crs_auth = mplcursors.cursor(auth, hover=True) @crs_auth.connect("add") def on_add(sel): sel.annotation.set(text=embedding_auth[2][sel.target.index]) # crs_word = mplcursors.cursor(word, hover=True) # @crs_word.connect("add") # def on_add(sel): # sel.annotation.set(text=embedding_word[2][sel.target.index]) crs_lab = mplcursors.cursor(lab, hover=True) @crs_lab.connect("add") def on_add(sel): sel.annotation.set(text=embedding_lab[2][sel.target.index]) plt.show() for i, (embedding_article, embedding_auth, embedding_word, embedding_lab, color_article, color_auth, color_word, color_lab) in enumerate(zip( article_embeddings_names, auth_embeddings_names, word_embeddings_names, lab_embeddings_names, colors_articles, colors_auth, colors_words, colors_lab)): plot(f"Year {2010 + (i * 2) + 2}", embedding_article, embedding_auth, embedding_word, embedding_lab, color_article, color_auth, color_word, color_lab) .. rst-class:: sphx-glr-horizontal * .. image-sg:: /auto_examples/images/sphx_glr_workflow_aligned_lisn_lsa_kmeans_001.png :alt: Year 2012 :srcset: /auto_examples/images/sphx_glr_workflow_aligned_lisn_lsa_kmeans_001.png :class: sphx-glr-multi-img * .. image-sg:: /auto_examples/images/sphx_glr_workflow_aligned_lisn_lsa_kmeans_002.png :alt: Year 2014 :srcset: /auto_examples/images/sphx_glr_workflow_aligned_lisn_lsa_kmeans_002.png :class: sphx-glr-multi-img * .. image-sg:: /auto_examples/images/sphx_glr_workflow_aligned_lisn_lsa_kmeans_003.png :alt: Year 2016 :srcset: /auto_examples/images/sphx_glr_workflow_aligned_lisn_lsa_kmeans_003.png :class: sphx-glr-multi-img * .. image-sg:: /auto_examples/images/sphx_glr_workflow_aligned_lisn_lsa_kmeans_004.png :alt: Year 2018 :srcset: /auto_examples/images/sphx_glr_workflow_aligned_lisn_lsa_kmeans_004.png :class: sphx-glr-multi-img * .. image-sg:: /auto_examples/images/sphx_glr_workflow_aligned_lisn_lsa_kmeans_005.png :alt: Year 2020 :srcset: /auto_examples/images/sphx_glr_workflow_aligned_lisn_lsa_kmeans_005.png :class: sphx-glr-multi-img * .. image-sg:: /auto_examples/images/sphx_glr_workflow_aligned_lisn_lsa_kmeans_006.png :alt: Year 2022 :srcset: /auto_examples/images/sphx_glr_workflow_aligned_lisn_lsa_kmeans_006.png :class: sphx-glr-multi-img .. rst-class:: sphx-glr-script-out .. code-block:: none /usr/local/lib/python3.9/site-packages/mplcursors/_pick_info.py:55: UserWarning: No data for colormapping provided via 'c'. Parameters 'vmin', 'vmax' will be ignored paths = scatter.__wrapped__(*args, **kwargs) .. GENERATED FROM PYTHON SOURCE LINES 736-742 Clustering ========== In order to identify clusters, we use the KMeans clustering technique on the articles. We'll also try to label these clusters by selecting the most frequent words that appear in each cluster's articles. .. GENERATED FROM PYTHON SOURCE LINES 742-777 .. code-block:: Python from cartodata.clustering import create_kmeans_clusters # noqa n_clusters = 8 def find_clusters(n_clusters, article_embeddings, word_embeddings, words_matrices, words_scores, matrices): c_scores_all = [] c_umap_all = [] for i in range(6): cluster_labels = [] c_lda, c_umap, c_scores, c_knn, _, _, _ = create_kmeans_clusters(n_clusters, # number of clusters to create # 2D matrix of articles article_embeddings[i].T, # the 2D matrix of words word_embeddings[i].T, # the articles to words matrix words_matrices[i], # word scores words_scores[i], # a list of initial cluster labels cluster_labels, # space matrix of words matrices[i][2]) c_scores_all.append(c_scores) c_umap_all.append(c_umap) return c_scores_all, c_umap_all c_scores_all, c_umap_all = find_clusters(n_clusters, article_embeddings, word_embeddings, words_matrices, words_scores, lsa_matrices) .. GENERATED FROM PYTHON SOURCE LINES 778-779 Let's plot with the clusters. .. GENERATED FROM PYTHON SOURCE LINES 779-788 .. code-block:: Python for i, (embedding_article, embedding_auth, embedding_word, embedding_lab, color_article, color_auth, color_word, color_lab, c_scores, c_umap) in enumerate(zip( article_embeddings_names, auth_embeddings_names, word_embeddings_names, lab_embeddings_names, colors_articles, colors_auth, colors_words, colors_lab, c_scores_all, c_umap_all)): plot(f"Year {2010+ (i*2) + 2}", embedding_article, embedding_auth, embedding_word, embedding_lab, color_article, color_auth, color_word, color_lab, c_scores, c_umap, n_clusters) .. rst-class:: sphx-glr-horizontal * .. image-sg:: /auto_examples/images/sphx_glr_workflow_aligned_lisn_lsa_kmeans_007.png :alt: Year 2012 :srcset: /auto_examples/images/sphx_glr_workflow_aligned_lisn_lsa_kmeans_007.png :class: sphx-glr-multi-img * .. image-sg:: /auto_examples/images/sphx_glr_workflow_aligned_lisn_lsa_kmeans_008.png :alt: Year 2014 :srcset: /auto_examples/images/sphx_glr_workflow_aligned_lisn_lsa_kmeans_008.png :class: sphx-glr-multi-img * .. image-sg:: /auto_examples/images/sphx_glr_workflow_aligned_lisn_lsa_kmeans_009.png :alt: Year 2016 :srcset: /auto_examples/images/sphx_glr_workflow_aligned_lisn_lsa_kmeans_009.png :class: sphx-glr-multi-img * .. image-sg:: /auto_examples/images/sphx_glr_workflow_aligned_lisn_lsa_kmeans_010.png :alt: Year 2018 :srcset: /auto_examples/images/sphx_glr_workflow_aligned_lisn_lsa_kmeans_010.png :class: sphx-glr-multi-img * .. image-sg:: /auto_examples/images/sphx_glr_workflow_aligned_lisn_lsa_kmeans_011.png :alt: Year 2020 :srcset: /auto_examples/images/sphx_glr_workflow_aligned_lisn_lsa_kmeans_011.png :class: sphx-glr-multi-img * .. image-sg:: /auto_examples/images/sphx_glr_workflow_aligned_lisn_lsa_kmeans_012.png :alt: Year 2022 :srcset: /auto_examples/images/sphx_glr_workflow_aligned_lisn_lsa_kmeans_012.png :class: sphx-glr-multi-img .. rst-class:: sphx-glr-script-out .. code-block:: none /usr/local/lib/python3.9/site-packages/mplcursors/_pick_info.py:55: UserWarning: No data for colormapping provided via 'c'. Parameters 'vmin', 'vmax' will be ignored paths = scatter.__wrapped__(*args, **kwargs) .. GENERATED FROM PYTHON SOURCE LINES 789-794 The 8 clusters that we created give us a general idea of what the big clusters of data contain. But we'll probably want a finer level of detail if we start to zoom in and focus on smaller areas. So we'll also create a second bigger group of clusters. To do this, simply increase the number of clusters we want. .. GENERATED FROM PYTHON SOURCE LINES 794-801 .. code-block:: Python n_clusters = 32 mc_scores_all, mc_umap_all = find_clusters( n_clusters, article_embeddings, word_embeddings, words_matrices, words_scores, lsa_matrices ) .. GENERATED FROM PYTHON SOURCE LINES 802-819 Nearest neighbors ------------------------ One more thing which could be useful to appreciate the quality of our data would be to get each point's nearest neighbors. If our data processing is done correctly, we expect the related articles, labs, words and authors to be located close to each other. Finding nearest neighbors is a common task with various algorithms aiming to solve it. The `get_neighbors` method uses one of these algorithms to find the nearest points of each type. It takes an optional weight parameter to tweak the distance calculation to select points that have a higher score but are maybe a bit farther instead of just selecting the closest neighbors. Because we want to find the neighbors of each type (articles, authors, words, labs) for all of the entities, we call the `get_neighbors` method in a loop and store its results in an array. .. GENERATED FROM PYTHON SOURCE LINES 819-853 .. code-block:: Python from cartodata.neighbors import get_neighbors # noqa def find_neighbors(articles_scores, authors_scores, words_scores, labs_scores, matrices): weights = [0, 0.5, 0.5, 0] all_neighbors = [] all_scores = [] for i in range(6): scores = [articles_scores[i], authors_scores[i], words_scores[i], labs_scores[i]] neighbors = [] matrix = matrices[i] for idx in range(len(matrix)): neighbors.append(get_neighbors(matrix[idx], scores[idx], matrix, weights[idx]) ) all_neighbors.append(neighbors) all_scores.append(scores) return all_neighbors, all_scores all_neighbors, all_scores = find_neighbors(scores_article, filtered_auth_scores, words_scores, filtered_lab_scores, lsa_matrices) .. GENERATED FROM PYTHON SOURCE LINES 854-858 Exporting ----------------- We now have sufficient data to create a meaningfull visualization. .. GENERATED FROM PYTHON SOURCE LINES 858-899 .. code-block:: Python from cartodata.operations import export_to_json # noqa natures = ['articles', 'authors', 'words', 'labs', 'hl_clusters', 'ml_clusters'] def export(from_year, struct, article_embeddings, authors_embeddings, word_embeddings, lab_embeddings, c_umap, mc_umap, c_scores, mc_scores, all_neighbors, all_scores): for i in range(6): export_file = f"../datas/{struct}_workflow_lsa_aligned_{from_year}_{from_year + (2 + 2 * i)}.json" # add the clusters to list of 2d matrices and scores umap_matrices = [article_embeddings[i].T, authors_embeddings[i].T, word_embeddings[i].T, lab_embeddings[i].T, c_umap[i], mc_umap[i]] all_scores[i].extend([c_scores[i], mc_scores[i]]) # create a json export file with all the infos export_to_json(natures, umap_matrices, all_scores[i], export_file, neighbors_natures=natures[:4], neighbors=all_neighbors[i]) export(2010, "lisn", article_embeddings, auth_embeddings, word_embeddings, lab_embeddings, c_umap_all, mc_umap_all, c_scores_all, mc_scores_all, all_neighbors, all_scores) .. GENERATED FROM PYTHON SOURCE LINES 900-913 This creates the files: - `lisn_workflow_aligned_lsa_2010_2012.json` - `lisn_workflow_aligned_lsa_2010_2014.json` - `lisn_workflow_aligned_lsa_2010_2016.json` - `lisn_workflow_aligned_lsa_2010_2018.json` - `lisn_workflow_aligned_lsa_2010_2020.json` - `lisn_workflow_aligned_lsa_2010_2022.json` each of which contains a list of points ready to be imported into Cartolabe. Have a look at it to check that it contains everything. .. GENERATED FROM PYTHON SOURCE LINES 913-922 .. code-block:: Python import json # noqa export_file = "../datas/lisn_workflow_lsa_aligned_2010_2022.json" with open(export_file, 'r') as f: data = json.load(f) data[1] .. rst-class:: sphx-glr-script-out .. code-block:: none {'position': [-4.62356424331665, 1.2929304838180542], 'score': 1.0, 'rank': 1, 'nature': 'articles', 'label': 'Bandit-Based Genetic Programming with Application to Reinforcement Learning', 'neighbors': {'articles': [1, 216, 48, 443, 952, 1596, 2623, 78, 323, 1552], 'authors': [3044, 3034, 3026, 3046, 3152, 2987, 3086, 3088, 3043, 3028], 'words': [3767, 5394, 5395, 5396, 3768, 6285, 6340, 6339, 6303, 6230], 'labs': [7032, 6944, 7011, 6989, 7018, 7017, 6985, 7016, 7030, 7076]}} .. rst-class:: sphx-glr-timing **Total running time of the script:** (15 minutes 59.159 seconds) .. _sphx_glr_download_auto_examples_workflow_aligned_lisn_lsa_kmeans.py: .. only:: html .. container:: sphx-glr-footer sphx-glr-footer-example .. container:: sphx-glr-download sphx-glr-download-jupyter :download:`Download Jupyter notebook: workflow_aligned_lisn_lsa_kmeans.ipynb <workflow_aligned_lisn_lsa_kmeans.ipynb>` .. container:: sphx-glr-download sphx-glr-download-python :download:`Download Python source code: workflow_aligned_lisn_lsa_kmeans.py <workflow_aligned_lisn_lsa_kmeans.py>` .. container:: sphx-glr-download sphx-glr-download-zip :download:`Download zipped: workflow_aligned_lisn_lsa_kmeans.zip <workflow_aligned_lisn_lsa_kmeans.zip>` .. only:: html .. rst-class:: sphx-glr-signature `Gallery generated by Sphinx-Gallery <https://sphinx-gallery.github.io>`_