Skip to content

0.2.0

Latest
Compare
Choose a tag to compare
@GuillaumeDD GuillaumeDD released this 14 Jun 12:53
9b22990

Release: 0.2.0

This release adds new algorithms for keyword extraction, adds some example notebooks and fixes some bugs.

Graph-of-words representation

Addition of edge weighting option in graph-of-words. The weight of an edge is the co-occurrence count of the tokens.

Keyword Extraction

Implementation of the dense and inflexion k-core selection methods

k-core approaches allow the selection of cohesive keywords. Selected keywords correspond to a cohesive subgraph. In other words, the granularity of selection is at the level of cohesive subgraphs and nodes are selected by entire batch at a time. A key property is the selection of an automatically adaptive number of cohesive keywords.

Three selection methods are now available based on the k-core decomposition of the graph-of-words. The 'maximum' method simply selects the main core (the k-core with maximum k). This is the default method. It can be viewed as being too restrictive. Two other selection methods alleviate this limitation. On one hand, the 'density' method goes down the hierarchy of k-cores to select the one retaining the cohesiveness from the perspective of the density of the k-core. The most appropriate k-core is selected via the elbow method. On the other hand, the 'inflexion' method exploits the k-shell (which is the part of the k-core that does not survive in the (k+1)-core). It consists in going down the hierachy of k-cores as long as the shells increase in size, else stopping.

Example on 'density' method:

from gowpy.summarization.unsupervised import KcoreKeywordExtractor

extractor_kw = KcoreKeywordExtractor(directed=False, weighted=True, window_size=4,
                                     # Parameter to set the selection method
                                     selection_method='density')

Implementation of the CoreRank method

The CoreRank method extracts keywords from a text document at the node-level of a graph-of-words representation. Each node/token in the graph-of-words is associated with a score, namely the sum of the core numbers of its neighbors. Then each node is ranked in decreasing order of score.

This extractor allows both the selection of an automatically adaptive number of keywords and the selection of given number or proportion of keywords.

Example usage:

from gowpy.summarization.unsupervised import CoreRankKeywordExtractor

extractor_kw_cr = CoreRankKeywordExtractor(directed=False, weighted=True, window_size=4)

preprocessed_text = "..."  # preprocessed text in which to extract keywords

extractor_kw_cr.extract(preprocessed_text, n=5)

Graph Algorithm

Frequent Subgraphs

  • The GoWMiner can now be used to incrementally load results of more than one subgraph mining process.
  • Fix of a bug in the computation of the sparse matrix in the GoWVectorizer vectorizer.

Misc

  • Addition of example notebooks
  • Update of the documentation