Skip to content

Commit

Permalink
Merge pull request #5 from soda-inria/package2
Browse files Browse the repository at this point in the history
Build New Version 0.0.14 of carte-ai Package
  • Loading branch information
gaetanbrison authored Oct 28, 2024
2 parents 5adff03 + e4d7836 commit 6b0ad6a
Show file tree
Hide file tree
Showing 50 changed files with 3,384 additions and 515 deletions.
131 changes: 131 additions & 0 deletions INSTALL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,131 @@
# CARTE: <br />Pretraining and Transfer for Tabular Learning

![CARTE_outline](carte_ai/data/etc/outline_carte.jpg)

This repository contains the implementation of the paper CARTE: Pretraining and Transfer for Tabular Learning.

CARTE is a pretrained model for tabular data by treating each table row as a star graph and training a graph transformer on top of this representation.

## Installation

**Required dependencies**

CARTE works with PyTorch and python version >=3.10. Create a new environment with python 3.10, and install the appropriate PyTorch version for your machine. Then, install the dependencies with the requirements.txt file on your environment:

```
pip install -r requirements.txt
```

In the requirements.txt file the package `torch_scatter` may depend on the specific PyTorch version. It is recommended to install the appropriate version by changing the first line ('--find-links') to the specific version outlined in https://data.pyg.org/whl/.

To reproduce the results presented in our paper, install additional requirements with requirements-optional.txt file on your environment:

```
pip install -r requirements-optional.txt
```

**Downloading data**

The download of required data (Fasttext, datasets, etc) can be managed by running

```
python scripts/download_data.py -op <option for datasets> -ir <include raw data> -ik <include KEN data>
```

or change the options in bash file and running it with

```
bash scripts/download_data.sh
```

Note that the code will download the FastText embeddings if it is not present under the `data/etc` folder. If the embeddings are stored in a different directory, please change the 'config_directory["fasttext"]' in the `configs/directory.py`

The variables are:

- options (-op): Options to download preprocessed datasets used in our paper.<br/>
Stored under `data/data_singletable`.

- "carte" : No downloadings of datasets.
- "basic_examples" : Download 4 preprocessed datasets for running examples.
- "full_examples" : Download all 51 preprocessed datasets without the LLM features.
- "full_benchmark" : Download all 51 preprocessed datasets including the LLM features.

- include_raw (-ir) : Benchmark raw datasets <br/>
The original datasets without any preprocessing. "True" to download all 51 datasets or "False" otherwise. Stored under `data/data_raw`. See `scripts/preprocess_raw.py` for specific details on preprocessing.

- include_ken (-ik) : KEN (YAGO knowledge graph) embeddings <br/>
The KEN embeddings, which are knowledge graph embeddings of YAGO entities. "True" to download the embeddings or "False" otherwise. Stored under `data/etc`.

Example (in the prepared environment) downloading FastText embeddings and the 4 datasets for examples for running CARTE:

```
python scripts/download_data.py -op "basic_examples" -ir "False" -ik "False"
```

The datasets can also be found https://huggingface.co/datasets/inria-soda/carte-benchmark.

## Getting started

The best way to get familiar with using CARTE is through the examples. After setting up the datasets, run the following examples if needed.

**Running CARTE for singletables:** <br/>follow through `examples/1. carte_single_tables.ipynb`

**Running CARTE for multitables:** <br/>follow through `examples/2. carte_joint_learning.ipynb`

<em>Note: To run through the examples, it is recommended to have at least 64GB of RAM for single tables and 128GB for multitables. We are currently working to reduce the memory consumption.</em>

## Reproducing results of CARTE paper

Currently, we provide codes for generating results for singletables. The updates for reproducing results on multitables will be updated.

To generate results for singletables, run:

```
python scripts/evaluate_singletable.py -dn <data name> -nt <train size> -m <method to evaluation> -rs <random state values> -b <include bagging> -dv <device to run>
```

The variables are:

- data_name (-dn): Name of the dataset.<br/>
specific name under the `data/data_singletable` folder or "all" to run all the datasets.

- num_train (-nt) : Train size to evaluate. <br/>
"all" to run train sizes of {32, 64, 128, 256, 512, 1024, 2048}.

- method (-m) : Method to evaluate (See carte_singletable_baselines in `configs/carte_configs`)<br/>
- "full" : the full list of all baselines (See `carte_singletable_baselines['full']`).
- "reduced" : the reduced list of all baselines in CARTE paper (See `carte_singletable_baselines['reduced']`).
- "f-r" : the list of baselines excluding the reduced list from the full list.
- "any other method" : any other method in `carte_singletable_baselines['full']`.

- random_state (-rs) : Random state value. <br/>
"all" to run train sizes of {1, 2, 3, ..., 10}

- bagging (-b) : Indicate to include the bagging strategy or not. <br/>
"True" to include the bagging strategy in analysis. Note that for neural-networks based models, it runs the bagging strategy even when it is set to "False".

- device (-dv) : <br/>
"cpu" to run on cpus or "cuda" to run on gpus. Requires some specifications if ran on gpus.

Example running the 'wina poland' dataset with train size of 128 and random state 1 in the examples:
```
python scripts/evaluate_singletable.py -dn "wina_pl" -nt "128" -m "reduced" -rs "1" -b "False" -dv "cpu"
```
Running this will create a folder `results/singletable/wina_pl`, in which the results of each baseline will be stored as a csv file.

After obtaining the results under the `results/singletable` folder, run `scripts/compile_results_singletable.py` to compile results as a single dataframe, which will be saved as a csv file, named 'results_carte_baseline_singletable.csv', in the `results/compiled_results` folder.

Then, follow through `examples/3. carte_singletable_visualization` for visualization of the results.

<em>The script does not run the random search (as done in the CARTE paper). To ease the computations and visualization, we provide the parameters for each baselines found from the random search. However, running the total comparison may take a long time, and it is recommended to run them on a parallel computing machines (e.g., clusters). The evaluation script only shows the guidelines for reproducing the results and modifications for parallelization suitable for each use-case should be made. For visualization purposes, we also provide the compiled results. </em>

## Our paper

```
@article{kim2024carte,
title={CARTE: pretraining and transfer for tabular learning},
author={Kim, Myung Jun and Grinsztajn, L{\'e}o and Varoquaux, Ga{\"e}l},
journal={arXiv preprint arXiv:2402.16785},
year={2024}
}
```
7 changes: 7 additions & 0 deletions MANIFEST.in
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
# Include all parquet files in specific directories
#recursive-include carte_ai/data/ *.parquet
#recursive-include carte/data/ *.json


# Include specific model and binary files
include carte_ai/data/etc/kg_pretrained.pt
159 changes: 76 additions & 83 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,125 +1,118 @@
[![Downloads](https://img.shields.io/pypi/dm/carte-ai)](https://pypi.org/project/carte-ai/)
[![PyPI Version](https://img.shields.io/pypi/v/carte-ai)](https://pypi.org/project/carte-ai/)
[![Python Version](https://img.shields.io/pypi/pyversions/carte-ai)](https://pypi.org/project/carte-ai/)
[![Code Style: Black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)
[![License: MIT](https://img.shields.io/badge/license-MIT-blue.svg)](https://opensource.org/licenses/MIT)


# CARTE: <br />Pretraining and Transfer for Tabular Learning

![CARTE_outline](data/etc/outline_carte.jpg)
![CARTE_outline](carte_ai/data/etc/outline_carte.jpg)

This repository contains the implementation of the paper CARTE: Pretraining and Transfer for Tabular Learning.

CARTE is a pretrained model for tabular data by treating each table row as a star graph and training a graph transformer on top of this representation.

## Installation
## Colab Examples (Give it a test):
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1PeltEmNLehQ26VQtFJhl7OxnzCS8rPMT?usp=sharing)
* CARTERegressor on Wine Poland dataset
* CARTEClassifier on Spotify dataset


**Required dependencies**

CARTE works with PyTorch and python version >=3.10. Create a new environment with python 3.10, and install the appropriate PyTorch version for your machine. Then, install the dependencies with the requirements.txt file on your environment:
### 01 Install 🚀

```
pip install -r requirements.txt
```
The library has been tested on Linux, MacOSX and Windows.

In the requirements.txt file the package `torch_scatter` may depend on the specific PyTorch version. It is recommended to install the appropriate version by changing the first line ('--find-links') to the specific version outlined in https://data.pyg.org/whl/.
CARTE-AI can be installed from [PyPI](https://pypi.org/project/carte-ai):

To reproduce the results presented in our paper, install additional requirements with requirements-optional.txt file on your environment:

```
pip install -r requirements-optional.txt
```
<pre>
pip install carte-ai
</pre>

**Downloading data**
#### Post installation check
After a correct installation, you should be able to import the module without errors:

The download of required data (Fasttext, datasets, etc) can be managed by running

```
python scripts/download_data.py -op <option for datasets> -ir <include raw data> -ik <include KEN data>
```

or change the options in bash file and running it with

```
bash scripts/download_data.sh
```python
import carte_ai
```

Note that the code will download the FastText embeddings if it is not present under the `data/etc` folder. If the embeddings are stored in a different directory, please change the 'config_directory["fasttext"]' in the `configs/directory.py`
### 02 CARTE-AI example on sampled data step by step ➡️

The variables are:
#### 1️⃣ Load the Data 💽
```python
import pandas as pd
from carte_ai.data.load_data import *

- options (-op): Options to download preprocessed datasets used in our paper.<br/>
Stored under `data/data_singletable`.

- "carte" : No downloadings of datasets.
- "basic_examples" : Download 4 preprocessed datasets for running examples.
- "full_examples" : Download all 51 preprocessed datasets without the LLM features.
- "full_benchmark" : Download all 51 preprocessed datasets including the LLM features.

- include_raw (-ir) : Benchmark raw datasets <br/>
The original datasets without any preprocessing. "True" to download all 51 datasets or "False" otherwise. Stored under `data/data_raw`. See `scripts/preprocess_raw.py` for specific details on preprocessing.

- include_ken (-ik) : KEN (YAGO knowledge graph) embeddings <br/>
The KEN embeddings, which are knowledge graph embeddings of YAGO entities. "True" to download the embeddings or "False" otherwise. Stored under `data/etc`.

Example (in the prepared environment) downloading FastText embeddings and the 4 datasets for examples for running CARTE:

```
python scripts/download_data.py -op "basic_examples" -ir "False" -ik "False"
num_train = 128 # Example: set the number of training groups/entities
random_state = 1 # Set a random seed for reproducibility
X_train, X_test, y_train, y_test = wina_pl(num_train, random_state)
print("Wina Poland dataset:", X_train.shape, X_test.shape)
```
![sample](images/data_wina.png)

The datasets can also be found https://huggingface.co/datasets/inria-soda/carte-benchmark.
#### 2️⃣ Convert Table 2 Graph 🪵

## Getting started
The basic preparations are:
- preprocess raw data
- load the prepared data and configs; set train/test split
- generate graphs for each table entries (rows) using the Table2GraphTransformer
- create an estimator and make inference

The best way to get familiar with using CARTE is through the examples. After setting up the datasets, run the following examples if needed.
```python
import fasttext
from huggingface_hub import hf_hub_download
from carte_ai import Table2GraphTransformer

**Running CARTE for singletables:** <br/>follow through `examples/1. carte_single_tables.ipynb`
model_path = hf_hub_download(repo_id="hi-paris/fastText", filename="cc.en.300.bin")

**Running CARTE for multitables:** <br/>follow through `examples/2. carte_joint_learning.ipynb`
preprocessor = Table2GraphTransformer(fasttext_model_path=model_path)

<em>Note: To run through the examples, it is recommended to have at least 64GB of RAM for single tables and 128GB for multitables. We are currently working to reduce the memory consumption.</em>

## Reproducing results of CARTE paper

Currently, we provide codes for generating results for singletables. The updates for reproducing results on multitables will be updated.

To generate results for singletables, run:
# Fit and transform the training data
X_train = preprocessor.fit_transform(X_train, y=y_train)

# Transform the test data
X_test = preprocessor.transform(X_test)
```
python scripts/evaluate_singletable.py -dn <data name> -nt <train size> -m <method to evaluation> -rs <random state values> -b <include bagging> -dv <device to run>
```
![sample](images/t2g.png)

The variables are:
#### 3️⃣ Make Predictions🔮
For learning, CARTE currently runs with the sklearn interface (fit/predict) and the process is:
- Define parameters
- Set the estimator
- Run 'fit' to train the model and 'predict' to make predictions

- data_name (-dn): Name of the dataset.<br/>
specific name under the `data/data_singletable` folder or "all" to run all the datasets.
```python
from carte_ai import CARTERegressor, CARTEClassifier

- num_train (-nt) : Train size to evaluate. <br/>
"all" to run train sizes of {32, 64, 128, 256, 512, 1024, 2048}.
# Define some parameters
fixed_params = dict()
fixed_params["num_model"] = 10 # 10 models for the bagging strategy
fixed_params["disable_pbar"] = False # True if you want cleanness
fixed_params["random_state"] = 0
fixed_params["device"] = "cpu"
fixed_params["n_jobs"] = 10
fixed_params["pretrained_model_path"] = config_directory["pretrained_model"]

- method (-m) : Method to evaluate (See carte_singletable_baselines in `configs/carte_configs`)<br/>
- "full" : the full list of all baselines (See `carte_singletable_baselines['full']`).
- "reduced" : the reduced list of all baselines in CARTE paper (See `carte_singletable_baselines['reduced']`).
- "f-r" : the list of baselines excluding the reduced list from the full list.
- "any other method" : any other method in `carte_singletable_baselines['full']`.

- random_state (-rs) : Random state value. <br/>
"all" to run train sizes of {1, 2, 3, ..., 10}
# Define the estimator and run fit/predict

- bagging (-b) : Indicate to include the bagging strategy or not. <br/>
"True" to include the bagging strategy in analysis. Note that for neural-networks based models, it runs the bagging strategy even when it is set to "False".
estimator = CARTERegressor(**fixed_params) # CARTERegressor for Regression
estimator.fit(X=X_train, y=y_train)
y_pred = estimator.predict(X_test)

- device (-dv) : <br/>
"cpu" to run on cpus or "cuda" to run on gpus. Requires some specifications if ran on gpus.
# Obtain the r2 score on predictions

Example running the 'wina poland' dataset with train size of 128 and random state 1 in the examples:
```
python scripts/evaluate_singletable.py -dn "wina_pl" -nt "128" -m "reduced" -rs "1" -b "False" -dv "cpu"
score = r2_score(y_test, y_pred)
print(f"\nThe R2 score for CARTE:", "{:.4f}".format(score))
```
Running this will create a folder `results/singletable/wina_pl`, in which the results of each baseline will be stored as a csv file.

After obtaining the results under the `results/singletable` folder, run `scripts/compile_results_singletable.py` to compile results as a single dataframe, which will be saved as a csv file, named 'results_carte_baseline_singletable.csv', in the `results/compiled_results` folder.
![sample](images/performance.png)

Then, follow through `examples/3. carte_singletable_visualization` for visualization of the results.
### 03 Reproducing paper results ⚙️

<em>The script does not run the random search (as done in the CARTE paper). To ease the computations and visualization, we provide the parameters for each baselines found from the random search. However, running the total comparison may take a long time, and it is recommended to run them on a parallel computing machines (e.g., clusters). The evaluation script only shows the guidelines for reproducing the results and modifications for parallelization suitable for each use-case should be made. For visualization purposes, we also provide the compiled results. </em>
➡️ [installation instructions setup paper](INSTALL.md)

## Our paper
### 04 CARTE-AI references 📚

```
@article{kim2024carte,
Expand Down
5 changes: 5 additions & 0 deletions carte_ai/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
from carte_ai.src import *
from carte_ai.configs import *
from carte_ai.data import *
from carte_ai.scripts import *
from .src import CARTERegressor, CARTEClassifier, Table2GraphTransformer
4 changes: 4 additions & 0 deletions carte_ai/configs/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
from carte_ai.configs.carte_configs import *
from carte_ai.configs.directory import *
from carte_ai.configs.model_parameters import *
from carte_ai.configs.visuailization import *
File renamed without changes.
10 changes: 4 additions & 6 deletions configs/directory.py → carte_ai/configs/directory.py
Original file line number Diff line number Diff line change
@@ -1,15 +1,13 @@
"""
Configurations for directory
"""

from pathlib import Path

base_path = Path().cwd()
# Get the base path relative to this file's location
base_path = Path(__file__).resolve().parent.parent # This gives '/home/infres/gbrison/carte/carte_ai'

config_directory = dict()
config_directory["base_path"] = base_path

config_directory["data"] = str(base_path / "data/")
config_directory["pretrained_model"] = str(base_path / "data/etc/kg_pretrained.pt")
config_directory["pretrained_model"] = str(base_path / "data/etc/kg_pretrained.pt") # Correct path
config_directory["data_raw"] = str(base_path / "data/data_raw/")
config_directory["data_singletable"] = str(base_path / "data/data_singletable/")
config_directory["data_yago"] = str(base_path / "data/data_yago/")
Expand Down
File renamed without changes.
File renamed without changes.
1 change: 1 addition & 0 deletions carte_ai/data/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
from carte_ai.data.load_data import *
1 change: 1 addition & 0 deletions carte_ai/data/config_spotify.json
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
{"entity_name": "track", "target_name": "popularity", "task": "classification", "repeated": false}
1 change: 1 addition & 0 deletions carte_ai/data/config_wine_dot_com_prices.json
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
{"entity_name": "Names", "target_name": "Prices", "task": "regression", "repeated": false}
1 change: 1 addition & 0 deletions carte_ai/data/config_wine_pl.json
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
{"entity_name": "name", "target_name": "price", "task": "regression", "repeated": false}
1 change: 1 addition & 0 deletions carte_ai/data/config_wine_vivino_price.json
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
{"entity_name": "Name", "target_name": "Price", "task": "regression", "repeated": false}
File renamed without changes.
File renamed without changes
Loading

0 comments on commit 6b0ad6a

Please sign in to comment.