-
Notifications
You must be signed in to change notification settings - Fork 45
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding the libre textbooks #149
base: main
Are you sure you want to change the base?
Conversation
Hopefully this PR should be ok for our first version of this dataset. In our next version, I'd like to remove exercises along with their solutions from the dataset + encode chemicals in a consistent format. Ps. |
pre-commit.ci autofix |
for more information, see https://pre-commit.ci
Hey @hssn-20, thank you very much for the PR! 🙏 |
import yaml | ||
|
||
|
||
LINES_TO_REMOVE = "/workspaces/chemnlp/data/libre_textbooks/lines_to_remove.jsonl" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not used below. Are those lines already removed on the HF dataset upload?
data/libre_textbooks/transform.py
Outdated
"identifiers": [ | ||
{ | ||
"id": "url ", # column name | ||
"type": "OTHER", # can be "SMILES", "SELFIES", "IUPAC", "OTHER" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did run the commit hooks through with "OTHER" (capital letters)?
"id": "html", # name of the column in a tabular dataset | ||
"description": "A scraped page from libre textbooks", | ||
"units": None, # units of the values in this column (leave empty if unitless) | ||
"type": "string", # can be "categorical", "ordinal", "continuous", "string" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"type": "string", # can be "categorical", "ordinal", "continuous", "string" | |
"type": "text", # can be "categorical", "ordinal", "continuous", "text" |
- id: text_length | ||
type: int | ||
description: text character count |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- id: text_length | |
type: int | |
description: text character count |
This script imports an uploaded libre chemistry textbooks from Hugging Face, cleans the data by removing hyperlinks, licenses, and chapter headers, and then removes specific lines based on manual selection. The cleaned data is then saved, and a metadata YAML file is generated based on a template. Here's a colab notebook which implements the process.