Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make Wikipedia corpus metadata accessible #3007

Open
wants to merge 9 commits into
base: develop
Choose a base branch
from

Conversation

kumarneelabh13
Copy link

@kumarneelabh13 kumarneelabh13 commented Nov 26, 2020

Background: I'd like article text along with the title for my project. The current implementation provides no way for the users to retrieve the title (i.e. metadata).

Allow users to access metadata by allowing self.metadata in WikiCorpus to be set by a parameter. However, Dictionary() raises "TypeError: decoding to str: need a bytes-like object, list found" if metadata is returned by get_texts(). So, introduced a dictionary_mode parameter in get_texts() so that metadata bypasses the dictionary, and goes directly to the user (if user sets metadata = True).

Let the users have metadata (e.g. title) if they need it. Added an argument in WikiCorpus __init__() to specify if metadata is needed. Previously, it was set to False and could not be toggled.
Make Wikipedia corpus metadata accessible.
Allow users to access metadata by allowing self.metadata in WikiCorpus to be set by a parameter. However, Dictionary() raises "TypeError: decoding to str: need a bytes-like object, list found" if metadata is returned. So, introduced a dictionary_mode parameter in get_texts() so that metadata bypasses the dictionary, and goes directly to the user.
@@ -612,6 +613,8 @@ def __init__(self, fname, processes=None, lemmatize=utils.has_pattern(), diction
If set, each XML article element will be passed to this callable before being processed. Only articles
where the callable returns an XML element are processed, returning None allows filtering out
some articles based on customised rules.
metadata: bool, optional
if True - write article titles to corpus
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What "corpus"? Please make the docstring more explicit, less cryptic (and properly capitalized and punctuated, like the others).

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can edit PRs in place – no need to open a new PR for each change.

Copy link
Author

@kumarneelabh13 kumarneelabh13 Nov 26, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for being patient with my first PR (ever!).
Updated documentation. Updated PR description.

"What "corpus"? Please make the docstring more explicit, less cryptic (and properly capitalized and punctuated, like the others)."

  • Updated documentation (copied an existing comment describing 'metadata' parameter).

@mpenkov
Copy link
Collaborator

mpenkov commented Jun 29, 2021

@kumar-neelabh Can you please add some tests for your new functionality?

@mpenkov mpenkov added the stale Waiting for author to complete contribution, no recent effort label Jun 29, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stale Waiting for author to complete contribution, no recent effort
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants