Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dynamically add table aliases without an LLM + de-duplicate columns in pandas #155

Merged
merged 12 commits into from
Jun 5, 2024

Conversation

rishsriv
Copy link
Member

@rishsriv rishsriv commented Jun 4, 2024

This automatically generates relevant table aliases and appends it to a prompt. Doing so transfers the onus of creating table aliases away from the LLM. We may have to retrain our LLM to expect this kind of prompting, so that it expects a more varied source of inputs.

Here's an example of how to run this.

python main.py \
-db postgres \
-q "data/questions_gen_postgres.csv" "data/instruct_basic_postgres.csv" "data/instruct_advanced_postgres.csv" "data/idk.csv" \
-o results/classic_new_reprompt.csv results/basic_new_reprompt.csv results/advanced_new_reprompt.csv results/idk_new_reprompt.csv \
-g api \
-b 1 \
-f prompts/prompt_cot.md \
--api_url "YOUR_API_ENDPOINT" \
--api_type "vllm" \
-p 10 \
-c 0 --logprobs --cot_table_alias

Copy link
Collaborator

@wongjingping wongjingping left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the deduplicating fix and the updates! 2 small comments

)

if "cot_instructions":
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: shall we add a separate argument in generate_prompt, say cot_pregen, to tack on the table aliases? I was thinking cot_instructions would be more of the "instructions" of what to do, while cot_pregen would be the actual output (eg table aliases)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aye sounds good!

DDL statements:
{table_metadata_string}

{cot_instructions}Generate a valid SQL query that answers the question `{user_question}`, and only references the tables and columns in the DDL statements.<|eot_id|><|start_header_id|>assistant<|end_header_id|>
Generate a valid SQL query that answers the question `{user_question}`, and only references the tables and columns in the DDL statements.<|eot_id|><|start_header_id|>assistant<|end_header_id|>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we keep the cot_instructions field here? This is more for letting the model know that it should be generating the aliases followed by the SQL.
Currently in the data we use the instruction to let the model know when it should generate the table aliases, and when it should just directly generate the SQL (to facilitate the mixing of both types of data).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good – making a fix now!

@rishsriv
Copy link
Member Author

rishsriv commented Jun 5, 2024

Fixed! We can now use --cot_table_alias instruct to get the model to use {cot_instructions}, and --cot_table_alias pregen to pre-generate table aliases

@rishsriv rishsriv merged commit fabdc91 into main Jun 5, 2024
2 checks passed
@rishsriv rishsriv deleted the rishabh/dedup-columns branch June 5, 2024 03:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants