https://mybinder.org/badge_logo.svg

Cartolabe-data

Cartolabe-data is the data processing part of the Cartolabe project. It contains utility functions to

  • retrieve data from the HAL open archive API

  • extract entities (authors, teams, labs, words) from a set a documents

  • reduce dimensions and project on a 2D space

  • create named clusters

  • identify nearest neighbors for each entity

Installation

Note: We recommend the use of a Python virtual env manager like conda or virtualenv.

First clone the source code:

git clone https://gitlab.inria.fr/cartolabe/cartolabe-data.git
cd cartolabe-data

It is preferable to install cartolabe-data in a Conda environment or Python virtual environment.

To create Conda environment from `environment.yml` file:

conda env create -f environment.yml
conda activate cartolabe-data   # activate environment

This will create a conda environment named as cartolabe-data and install cartolabe-data package.

To create Conda environment:

conda create -n cartodata-310 python==3.10.9
conda activate cartodata-310
pip install -e .

If nmslib is not installed, it can be installed by:

conda install -c conda-forge nmslib

If hdbscan is not installed, it can be installed by:

conda install -c conda-forge hdbscan

To create Python virtual environment:

python -m venv cartolabe_data python==3.10.9
. cartolabe_data/bin/activate    # activate environment

After creating the Python virtual environment, you can install the cartolabe-data package by running the following command from project root directory:

pip install -e .

Run the tests

pip install -e .[test]
pytest

Notebooks

The best way to get started with cartolabe-data is to run through the set of example notebooks in the examples directory.

To run the examples:

pip install -e .[examples]
cd examples
jupyter notebook

Docker

It is also possible to run cartolabe-data from the docker image without cloning or installing it. However you should have docker installed on your host.

To run an interactive container from the image:

docker run -it --network=host registry.gitlab.inria.fr/cartolabe/cartolabe-data:latest

From the command line provided by the container, it is possible to execute the CLI commands or it is possible to run the Jupyter notebooks with the command:

jupyter notebook

Then open the provided http link in the browser.

The notebooks are in the examples directory.

CLI commands

Once installed, the cartolabe-data package provides command-line scripts which can be executed in a terminal.

fetch-data

The fetch-data command will extract data from the HAL Open Archive. It takes three optional parameters:

  • -s <str> a research organization to filter publications for

  • -f <int> the min publication year

  • -t <int> the max publication year

To fetch articles published by the CNRS between 2010 and 2016, in a terminal with the active environment where you installed the package, run

cartodata fetch-data -s CNRS -f 2010 -t 2016

Output data will be saved to the datas directory in csv format.

workflow

The workflow command runs one of the predefined workflows to produce data usable by Cartolabe.

It takes one required argument (the name of one of the predefined workflows) and one optional argument (the output directory).

To run the LRI workflow, in a terminal with the active environment where you installed the package, run

mkdir dumps
cartodata workflow -o dumps/lri lri

This will run the set of instructions in the cartodata/workflows/hal module and output the results in the dumps/lri directory.

About

Cartolabe is a project developped by Inria & CNRS.

Indices and tables