Cartolabe-data¶
Cartolabe-data is the data processing part of the Cartolabe project. It contains utility functions to
retrieve data from the HAL open archive API
extract entities (authors, teams, labs, words) from a set a documents
reduce dimensions and project on a 2D space
create named clusters
identify nearest neighbors for each entity
Installation¶
Note: We recommend the use of a Python virtual env manager like conda or virtualenv.
First clone the source code:
git clone https://gitlab.inria.fr/cartolabe/cartolabe-data.git
cd cartolabe-data
It is preferable to install cartolabe-data in a Conda environment or Python virtual environment.
To create Conda environment from `environment.yml` file:
conda env create -f environment.yml
conda activate cartolabe-data # activate environment
This will create a conda environment named as cartolabe-data and install cartolabe-data package.
To create Conda environment:
conda create -n cartodata-310 python==3.10.9
conda activate cartodata-310
pip install -e .
If nmslib is not installed, it can be installed by:
conda install -c conda-forge nmslib
If hdbscan is not installed, it can be installed by:
conda install -c conda-forge hdbscan
To create Python virtual environment:
python -m venv cartolabe_data python==3.10.9
. cartolabe_data/bin/activate # activate environment
After creating the Python virtual environment, you can install the cartolabe-data package by running the following command from project root directory:
pip install -e .
Run the tests¶
pip install -e .[test]
pytest
Notebooks¶
The best way to get started with cartolabe-data is to run through the set of example notebooks in the examples directory.
To run the examples:
pip install -e .[examples]
cd examples
jupyter notebook
Docker¶
It is also possible to run cartolabe-data from the docker image without cloning or installing it. However you should have docker installed on your host.
To run an interactive container from the image:
docker run -it --network=host registry.gitlab.inria.fr/cartolabe/cartolabe-data:latest
From the command line provided by the container, it is possible to execute the CLI commands or it is possible to run the Jupyter notebooks with the command:
jupyter notebook
Then open the provided http link in the browser.
The notebooks are in the examples directory.
CLI commands¶
Once installed, the cartolabe-data package provides command-line scripts which can be executed in a terminal.
fetch-data¶
The fetch-data command will extract data from the HAL Open Archive. It takes three optional parameters:
-s <str> a research organization to filter publications for
-f <int> the min publication year
-t <int> the max publication year
To fetch articles published by the CNRS between 2010 and 2016, in a terminal with the active environment where you installed the package, run
cartodata fetch-data -s CNRS -f 2010 -t 2016
Output data will be saved to the datas
directory in csv format.
workflow¶
The workflow command runs one of the predefined workflows to produce data usable by Cartolabe.
It takes one required argument (the name of one of the predefined workflows) and one optional argument (the output directory).
To run the LRI workflow, in a terminal with the active environment where you installed the package, run
mkdir dumps
cartodata workflow -o dumps/lri lri
This will run the set of instructions in the cartodata/workflows/hal
module and output the results in the dumps/lri
directory.
About¶
Cartolabe is a project developped by Inria & CNRS.