Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merging recomm_sys #72

Open
wants to merge 23 commits into
base: master
Choose a base branch
from
Open

Merging recomm_sys #72

wants to merge 23 commits into from

Conversation

akkadhim
Copy link

@akkadhim akkadhim commented Dec 25, 2024

Adding the recommendation system experiments. Please ignore any changes outside the (examples/recomm_system) directory.

@BooBSD
Copy link

BooBSD commented Dec 26, 2024

@akkadhim Could you please export your noisy datasets to a CSV file for testing in other languages?

@akkadhim
Copy link
Author

@akkadhim Could you please export your noisy datasets to a CSV file for testing in other languages?

Sure, below are different datasets for different noise ratios.

noisy_dataset_0.05.csv
noisy_dataset_0.005.csv
noisy_dataset_0.02.csv
noisy_dataset_0.2.csv
noisy_dataset_0.01.csv
noisy_dataset_0.1.csv

@BooBSD
Copy link

BooBSD commented Dec 27, 2024

@akkadhim Thank you!

@BooBSD
Copy link

BooBSD commented Dec 27, 2024

@akkadhim Is it correct that, after one-hot booleanization, your input data consists of 10709 bits? This includes 1350 unique product_ids + 317 categories + 9042 user_ids.

@akkadhim
Copy link
Author

@akkadhim Is it correct that, after one-hot booleanization, your input data consists of 10709 bits? This includes 1350 unique product_ids + 317 categories + 9042 user_ids.
After expanding the original dataset and adding the noise, the unique features will be:
Users: 1193
Items: 1350
Categories: 211
I used the one_hot_encoding for the TM classifier, and at that step, the dataset split to train and test portions.

@BooBSD
Copy link

BooBSD commented Dec 27, 2024

@akkadhim
Got it. However, the columns category and user_id contain lists of categories and users, joined by the "|" and "," characters (for example: "Computers&Accessories|Accessories&Peripherals|Cables&Accessories|Cables|USBCables" or "AH4BURHCF5UQFZR4VJQXBEQCTYVQ,AGSJLPK6HU2FB4HII64NQ3OYFFFA,AGG75KFRXNLCYVRAPA6D4ZBNTNSA"). Why weren’t they split into individual unique categories and user IDs? Could you confirm if your method of booleanization is correct?

@BooBSD
Copy link

BooBSD commented Dec 27, 2024

@akkadhim
I tested both booleanization methods (yours and mine) and obtained approximately the same validation accuracy.
I split your dataset such that the first 80% is used for training, and the last 20% for validation.

My best validation accuracy:

  • noisy_dataset_0.005.csv: 99.73%
  • noisy_dataset_0.2.csv: 84.87%

Here is the proof:

#1  Accuracy: 83.81%  Best: 83.81%  Training: 1.946s  Testing: 0.107s
#2  Accuracy: 96.69%  Best: 96.69%  Training: 0.609s  Testing: 0.009s
#3  Accuracy: 99.69%  Best: 99.69%  Training: 0.442s  Testing: 0.008s
#4  Accuracy: 99.69%  Best: 99.69%  Training: 0.350s  Testing: 0.007s
#5  Accuracy: 99.69%  Best: 99.69%  Training: 0.279s  Testing: 0.007s
#6  Accuracy: 99.69%  Best: 99.69%  Training: 0.238s  Testing: 0.006s
#7  Accuracy: 99.69%  Best: 99.69%  Training: 0.192s  Testing: 0.006s
#8  Accuracy: 99.69%  Best: 99.69%  Training: 0.178s  Testing: 0.006s
#9  Accuracy: 99.69%  Best: 99.69%  Training: 0.173s  Testing: 0.006s
#10  Accuracy: 99.69%  Best: 99.69%  Training: 0.147s  Testing: 0.005s
....
#300  Accuracy: 99.73%  Best: 99.73%  Training: 0.085s  Testing: 0.003s
#301  Accuracy: 99.69%  Best: 99.73%  Training: 0.090s  Testing: 0.003s
#302  Accuracy: 99.73%  Best: 99.73%  Training: 0.086s  Testing: 0.003s
#303  Accuracy: 99.73%  Best: 99.73%  Training: 0.084s  Testing: 0.003s
#304  Accuracy: 99.73%  Best: 99.73%  Training: 0.081s  Testing: 0.003s
#305  Accuracy: 99.73%  Best: 99.73%  Training: 0.089s  Testing: 0.003s
#306  Accuracy: 99.73%  Best: 99.73%  Training: 0.080s  Testing: 0.003s
#307  Accuracy: 99.73%  Best: 99.73%  Training: 0.081s  Testing: 0.003s
#308  Accuracy: 99.73%  Best: 99.73%  Training: 0.089s  Testing: 0.003s
#309  Accuracy: 99.73%  Best: 99.73%  Training: 0.088s  Testing: 0.003s
#310  Accuracy: 99.69%  Best: 99.73%  Training: 0.083s  Testing: 0.003s
#311  Accuracy: 99.69%  Best: 99.73%  Training: 0.081s  Testing: 0.003s
#312  Accuracy: 99.73%  Best: 99.73%  Training: 0.082s  Testing: 0.003s
#313  Accuracy: 99.73%  Best: 99.73%  Training: 0.079s  Testing: 0.003s
#314  Accuracy: 99.69%  Best: 99.73%  Training: 0.081s  Testing: 0.003s
#315  Accuracy: 99.73%  Best: 99.73%  Training: 0.083s  Testing: 0.003s
#316  Accuracy: 99.73%  Best: 99.73%  Training: 0.088s  Testing: 0.003s
#317  Accuracy: 99.73%  Best: 99.73%  Training: 0.085s  Testing: 0.003s
#318  Accuracy: 99.73%  Best: 99.73%  Training: 0.086s  Testing: 0.003s
#319  Accuracy: 99.73%  Best: 99.73%  Training: 0.088s  Testing: 0.003s
#320  Accuracy: 99.73%  Best: 99.73%  Training: 0.091s  Testing: 0.003s

These results were obtained on a CPU, and it works quite fast.

@akkadhim
Copy link
Author

@akkadhim Got it. However, the columns category and user_id contain lists of categories and users, joined by the "|" and "," characters (for example: "Computers&Accessories|Accessories&Peripherals|Cables&Accessories|Cables|USBCables" or "AH4BURHCF5UQFZR4VJQXBEQCTYVQ,AGSJLPK6HU2FB4HII64NQ3OYFFFA,AGG75KFRXNLCYVRAPA6D4ZBNTNSA"). Why weren’t they split into individual unique categories and user IDs? Could you confirm if your method of booleanization is correct?

For the user_id, the CSV formatting rules allow handling such cases by enclosing the value in double quotes, while the categories column format maintains the original structure of the dataset. Splitting these fields would alter the representation of hierarchical categories and associated user IDs.
Yes, it is correct.

@akkadhim
Copy link
Author

@akkadhim I tested both booleanization methods (yours and mine) and obtained approximately the same validation accuracy. I split your dataset such that the first 80% is used for training, and the last 20% for validation.

My best validation accuracy:

  • noisy_dataset_0.005.csv: 99.73%
  • noisy_dataset_0.2.csv: 84.87%

Here is the proof:

#1  Accuracy: 83.81%  Best: 83.81%  Training: 1.946s  Testing: 0.107s
#2  Accuracy: 96.69%  Best: 96.69%  Training: 0.609s  Testing: 0.009s
#3  Accuracy: 99.69%  Best: 99.69%  Training: 0.442s  Testing: 0.008s
#4  Accuracy: 99.69%  Best: 99.69%  Training: 0.350s  Testing: 0.007s
#5  Accuracy: 99.69%  Best: 99.69%  Training: 0.279s  Testing: 0.007s
#6  Accuracy: 99.69%  Best: 99.69%  Training: 0.238s  Testing: 0.006s
#7  Accuracy: 99.69%  Best: 99.69%  Training: 0.192s  Testing: 0.006s
#8  Accuracy: 99.69%  Best: 99.69%  Training: 0.178s  Testing: 0.006s
#9  Accuracy: 99.69%  Best: 99.69%  Training: 0.173s  Testing: 0.006s
#10  Accuracy: 99.69%  Best: 99.69%  Training: 0.147s  Testing: 0.005s
....
#300  Accuracy: 99.73%  Best: 99.73%  Training: 0.085s  Testing: 0.003s
#301  Accuracy: 99.69%  Best: 99.73%  Training: 0.090s  Testing: 0.003s
#302  Accuracy: 99.73%  Best: 99.73%  Training: 0.086s  Testing: 0.003s
#303  Accuracy: 99.73%  Best: 99.73%  Training: 0.084s  Testing: 0.003s
#304  Accuracy: 99.73%  Best: 99.73%  Training: 0.081s  Testing: 0.003s
#305  Accuracy: 99.73%  Best: 99.73%  Training: 0.089s  Testing: 0.003s
#306  Accuracy: 99.73%  Best: 99.73%  Training: 0.080s  Testing: 0.003s
#307  Accuracy: 99.73%  Best: 99.73%  Training: 0.081s  Testing: 0.003s
#308  Accuracy: 99.73%  Best: 99.73%  Training: 0.089s  Testing: 0.003s
#309  Accuracy: 99.73%  Best: 99.73%  Training: 0.088s  Testing: 0.003s
#310  Accuracy: 99.69%  Best: 99.73%  Training: 0.083s  Testing: 0.003s
#311  Accuracy: 99.69%  Best: 99.73%  Training: 0.081s  Testing: 0.003s
#312  Accuracy: 99.73%  Best: 99.73%  Training: 0.082s  Testing: 0.003s
#313  Accuracy: 99.73%  Best: 99.73%  Training: 0.079s  Testing: 0.003s
#314  Accuracy: 99.69%  Best: 99.73%  Training: 0.081s  Testing: 0.003s
#315  Accuracy: 99.73%  Best: 99.73%  Training: 0.083s  Testing: 0.003s
#316  Accuracy: 99.73%  Best: 99.73%  Training: 0.088s  Testing: 0.003s
#317  Accuracy: 99.73%  Best: 99.73%  Training: 0.085s  Testing: 0.003s
#318  Accuracy: 99.73%  Best: 99.73%  Training: 0.086s  Testing: 0.003s
#319  Accuracy: 99.73%  Best: 99.73%  Training: 0.088s  Testing: 0.003s
#320  Accuracy: 99.73%  Best: 99.73%  Training: 0.091s  Testing: 0.003s

These results were obtained on a CPU, and it works quite fast.

Very impressive! Nice work, @BooBSD!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants