Skip to content

Commit

Permalink
Provide documentation for recently added Count/CM Sketches
Browse files Browse the repository at this point in the history
impora: sntax fix
  • Loading branch information
gakhov committed Aug 27, 2019
1 parent c41d40c commit 5fdcf55
Show file tree
Hide file tree
Showing 5 changed files with 283 additions and 0 deletions.
5 changes: 5 additions & 0 deletions README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -76,6 +76,11 @@ The latest documentation can be found at `<http://pdsa.readthedocs.io/en/latest/
- `Probabilistic counter (Flajolet–Martin algorithm) <http://pdsa.readthedocs.io/en/latest/cardinality/probabilistic_counter.html>`_
- `HyperLogLog <http://pdsa.readthedocs.io/en/latest/cardinality/hyperloglog.html>`_

**Frequency problem**

- `Count Sketch <http://pdsa.readthedocs.io/en/latest/frequency/count_sketch.html>`_
- `Count-Min Sketch <http://pdsa.readthedocs.io/en/latest/frequency/count_min_sketch.html>`_

**Rank problem**

- `q-digest <http://pdsa.readthedocs.io/en/latest/rank/qdigest.html>`_
Expand Down
134 changes: 134 additions & 0 deletions docs/frequency/count_min_sketch.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,134 @@
Count-Min Sketch
================

Count–Min Sketch is a simple space-efficient probabilistic data structure
that is used to estimate frequencies of elements in data streams and can
address the Heavy hitters problem. It was presented in 2003 [1] by
Graham Cormode and Shan Muthukrishnan and published in 2005 [2].

References
----------
[1] Cormode, G., Muthukrishnan, S.
What's hot and what's not: Tracking most frequent items dynamically
Proceedings of the 22th ACM SIGMOD-SIGACT-SIGART symposium on Principles
of database systems, San Diego, California - June 09-11, 2003,
pp. 296–306, ACM New York, NY.
http://www.cs.princeton.edu/courses/archive/spr04/cos598B/bib/CormodeM-hot.pdf
[2] Cormode, G., Muthukrishnan, S.
An Improved Data Stream Summary: The Count–Min Sketch and its Applications
Journal of Algorithms, Vol. 55 (1), pp. 58–75.
http://dimacs.rutgers.edu/~graham/pubs/papers/cm-full.pdf


This implementation uses MurmurHash3 family of hash functions
which yields a 32-bit hash value. Thus, the length of the counters
is expected to be smaller or equal to the (2^{32} - 1), since
we cannot access elements with indexes above this value.


.. code:: python
from pdsa.frequency.count_min_sketch import CountMinSketch
cms = CountMinSketch(5, 2000)
cms.add("hello")
cms.frequency("hello")
Build a sketch
----------------

You can build a new sketch either from specifiyng its dimensions
(number of counter arrays and their length), or from the expected
overestimation diviation and standard error probability.


Build filter from its dimensions
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. code:: python
from pdsa.frequency.count_min_sketch import CountMinSketch
cms = CountMinSketch(num_of_counters=5, length_of_counter=2000)
Build filter from the expected errors
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

In this case the number of counter arrays and their length
will be calculated corresponsing to the expected overestimation
and the requested error.


.. code:: python
from pdsa.frequency.count_min_sketch import CountMinSketch
cms = CountMinSketch.create_from_expected_error(deviation=0.000001, error=0.01)
.. note::

The `deviation` is the error ε in answering the paricular query.
For example, if we expect 10^7 elements and allow the fixed
overestimate of 10, the deviation is 10/10^7 = 10^{-6}.

The `error` is the standard error δ (0 < error < 1).


.. note::

The Count–Min Sketch is approximate and probabilistic at the same
time, therefore two parameters, the error ε in answering the paricular
query and the error probability δ, affect the space and time
requirements. In fact, it provides the guarantee that the estimation
error for frequencies will not exceed ε x n
with probability at least 1 – δ.


Index element into the sketch
------------------------------


.. code:: python
cms.add("hello")
.. note::

It is possible to index into the counter any elements (internally
it uses *repr()* of the python object to calculate hash values for
elements that are not integers, strings or bytes.


Estmiate frequency of the element
---------------------------------------

.. code:: python
print(cms.frequency("hello"))
.. warning::

It is only an approximation of the exact frequency.



Size of the sketch in bytes
----------------------------

.. code:: python
print(cms.sizeof())
Length of the sketch
---------------------

.. code:: python
print(len(cms))
128 changes: 128 additions & 0 deletions docs/frequency/count_sketch.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,128 @@
Count Sketch
================

Count Sketch is a simple space-efficient probabilistic data structure
that is used to estimate frequencies of elements in data streams and can
address the Heavy hitters problem. It was proposed by Moses Charikar, Kevin Chen, and Martin Farach-Colton in 2002.

References
----------
[1] Charikar, M., Chen, K., Farach-Colton, M.
Finding Frequent Items in Data Streams
Proceedings of the 29th International Colloquium on Automata, Languages and
Programming, pp. 693–703, Springer, Heidelberg.
https://www.cs.rutgers.edu/~farach/pubs/FrequentStream.pdf


This implementation uses MurmurHash3 family of hash functions
which yields a 32-bit hash value. Thus, the length of the counters
is expected to be smaller or equal to the (2^{32} - 1), since
we cannot access elements with indexes above this value.


.. code:: python
from pdsa.frequency.count_min_sketch import CountSketch
cs = CountSketch(5, 2000)
cs.add("hello")
cs.frequency("hello")
Build a sketch
----------------

You can build a new sketch either from specifiyng its dimensions
(number of counter arrays and their length), or from the expected
overestimation diviation and standard error probability.


Build filter from its dimensions
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. code:: python
from pdsa.frequency.count_min_sketch import CountSketch
cs = CountSketch(num_of_counters=5, length_of_counter=2000)
Build filter from the expected errors
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

In this case the number of counter arrays and their length
will be calculated corresponsing to the expected overestimation
and the requested error.


.. code:: python
from pdsa.frequency.count_min_sketch import CountSketch
cs = CountSketch.create_from_expected_error(deviation=0.000001, error=0.01)
.. note::

The `deviation` is the error ε in answering the paricular query.
For example, if we expect 10^7 elements and allow the fixed
overestimate of 10, the deviation is 10/10^7 = 10^{-6}.

The `error` is the standard error δ (0 < error < 1).


.. note::

The Count–Min Sketch is approximate and probabilistic at the same
time, therefore two parameters, the error ε in answering the paricular
query and the error probability δ, affect the space and time
requirements. In fact, it provides the guarantee that the estimation
error for frequencies will not exceed ε x n
with probability at least 1 – δ.


Index element into the sketch
------------------------------


.. code:: python
cs.add("hello")
.. note::

It is possible to index into the counter any elements (internally
it uses *repr()* of the python object to calculate hash values for
elements that are not integers, strings or bytes.


Estmiate frequency of the element
---------------------------------------

.. code:: python
print(cs.frequency("hello"))
.. warning::

It is only an approximation of the exact frequency.



Size of the sketch in bytes
----------------------------

.. code:: python
print(cs.sizeof())
Length of the sketch
---------------------

.. code:: python
print(len(cs))
15 changes: 15 additions & 0 deletions docs/frequency/index.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
Frequency
============

Many important problems with streaming applications that operate large
data streams are related to the estimation of the frequencies of elements,
including determining the most frequent element or detecting the trending
ones over some period of time.



.. toctree::
:maxdepth: 2

count_sketch
count_min_sketch
1 change: 1 addition & 0 deletions docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -39,5 +39,6 @@ GitHub repository: `<https://github.com/gakhov/pdsa>`_

quickstart
cardinality/index
frequency/index
membership/index
rank/index

0 comments on commit 5fdcf55

Please sign in to comment.