diff --git a/README.rst b/README.rst index c7e55cc..607e49e 100644 --- a/README.rst +++ b/README.rst @@ -76,6 +76,11 @@ The latest documentation can be found at ``_ - `HyperLogLog `_ +**Frequency problem** + +- `Count Sketch `_ +- `Count-Min Sketch `_ + **Rank problem** - `q-digest `_ diff --git a/docs/frequency/count_min_sketch.rst b/docs/frequency/count_min_sketch.rst new file mode 100644 index 0000000..19ea3b4 --- /dev/null +++ b/docs/frequency/count_min_sketch.rst @@ -0,0 +1,134 @@ +Count-Min Sketch +================ + +Count–Min Sketch is a simple space-efficient probabilistic data structure +that is used to estimate frequencies of elements in data streams and can +address the Heavy hitters problem. It was presented in 2003 [1] by +Graham Cormode and Shan Muthukrishnan and published in 2005 [2]. + +References +---------- +[1] Cormode, G., Muthukrishnan, S. + What's hot and what's not: Tracking most frequent items dynamically + Proceedings of the 22th ACM SIGMOD-SIGACT-SIGART symposium on Principles + of database systems, San Diego, California - June 09-11, 2003, + pp. 296–306, ACM New York, NY. + http://www.cs.princeton.edu/courses/archive/spr04/cos598B/bib/CormodeM-hot.pdf +[2] Cormode, G., Muthukrishnan, S. + An Improved Data Stream Summary: The Count–Min Sketch and its Applications + Journal of Algorithms, Vol. 55 (1), pp. 58–75. + http://dimacs.rutgers.edu/~graham/pubs/papers/cm-full.pdf + + +This implementation uses MurmurHash3 family of hash functions +which yields a 32-bit hash value. Thus, the length of the counters +is expected to be smaller or equal to the (2^{32} - 1), since +we cannot access elements with indexes above this value. + + +.. code:: python + + from pdsa.frequency.count_min_sketch import CountMinSketch + + cms = CountMinSketch(5, 2000) + cms.add("hello") + cms.frequency("hello") + + + +Build a sketch +---------------- + +You can build a new sketch either from specifiyng its dimensions +(number of counter arrays and their length), or from the expected +overestimation diviation and standard error probability. + + +Build filter from its dimensions +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +.. code:: python + + from pdsa.frequency.count_min_sketch import CountMinSketch + + cms = CountMinSketch(num_of_counters=5, length_of_counter=2000) + + +Build filter from the expected errors +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +In this case the number of counter arrays and their length +will be calculated corresponsing to the expected overestimation +and the requested error. + + +.. code:: python + + from pdsa.frequency.count_min_sketch import CountMinSketch + + cms = CountMinSketch.create_from_expected_error(deviation=0.000001, error=0.01) + + +.. note:: + + The `deviation` is the error ε in answering the paricular query. + For example, if we expect 10^7 elements and allow the fixed + overestimate of 10, the deviation is 10/10^7 = 10^{-6}. + + The `error` is the standard error δ (0 < error < 1). + + +.. note:: + + The Count–Min Sketch is approximate and probabilistic at the same + time, therefore two parameters, the error ε in answering the paricular + query and the error probability δ, affect the space and time + requirements. In fact, it provides the guarantee that the estimation + error for frequencies will not exceed ε x n + with probability at least 1 – δ. + + +Index element into the sketch +------------------------------ + + +.. code:: python + + cms.add("hello") + + +.. note:: + + It is possible to index into the counter any elements (internally + it uses *repr()* of the python object to calculate hash values for + elements that are not integers, strings or bytes. + + +Estmiate frequency of the element +--------------------------------------- + +.. code:: python + + print(cms.frequency("hello")) + + +.. warning:: + + It is only an approximation of the exact frequency. + + + +Size of the sketch in bytes +---------------------------- + +.. code:: python + + print(cms.sizeof()) + + +Length of the sketch +--------------------- + +.. code:: python + + print(len(cms)) \ No newline at end of file diff --git a/docs/frequency/count_sketch.rst b/docs/frequency/count_sketch.rst new file mode 100644 index 0000000..4f077d7 --- /dev/null +++ b/docs/frequency/count_sketch.rst @@ -0,0 +1,128 @@ +Count Sketch +================ + +Count Sketch is a simple space-efficient probabilistic data structure +that is used to estimate frequencies of elements in data streams and can +address the Heavy hitters problem. It was proposed by Moses Charikar, Kevin Chen, and Martin Farach-Colton in 2002. + +References +---------- +[1] Charikar, M., Chen, K., Farach-Colton, M. + Finding Frequent Items in Data Streams + Proceedings of the 29th International Colloquium on Automata, Languages and + Programming, pp. 693–703, Springer, Heidelberg. + https://www.cs.rutgers.edu/~farach/pubs/FrequentStream.pdf + + +This implementation uses MurmurHash3 family of hash functions +which yields a 32-bit hash value. Thus, the length of the counters +is expected to be smaller or equal to the (2^{32} - 1), since +we cannot access elements with indexes above this value. + + +.. code:: python + + from pdsa.frequency.count_min_sketch import CountSketch + + cs = CountSketch(5, 2000) + cs.add("hello") + cs.frequency("hello") + + + +Build a sketch +---------------- + +You can build a new sketch either from specifiyng its dimensions +(number of counter arrays and their length), or from the expected +overestimation diviation and standard error probability. + + +Build filter from its dimensions +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +.. code:: python + + from pdsa.frequency.count_min_sketch import CountSketch + + cs = CountSketch(num_of_counters=5, length_of_counter=2000) + + +Build filter from the expected errors +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +In this case the number of counter arrays and their length +will be calculated corresponsing to the expected overestimation +and the requested error. + + +.. code:: python + + from pdsa.frequency.count_min_sketch import CountSketch + + cs = CountSketch.create_from_expected_error(deviation=0.000001, error=0.01) + + +.. note:: + + The `deviation` is the error ε in answering the paricular query. + For example, if we expect 10^7 elements and allow the fixed + overestimate of 10, the deviation is 10/10^7 = 10^{-6}. + + The `error` is the standard error δ (0 < error < 1). + + +.. note:: + + The Count–Min Sketch is approximate and probabilistic at the same + time, therefore two parameters, the error ε in answering the paricular + query and the error probability δ, affect the space and time + requirements. In fact, it provides the guarantee that the estimation + error for frequencies will not exceed ε x n + with probability at least 1 – δ. + + +Index element into the sketch +------------------------------ + + +.. code:: python + + cs.add("hello") + + +.. note:: + + It is possible to index into the counter any elements (internally + it uses *repr()* of the python object to calculate hash values for + elements that are not integers, strings or bytes. + + +Estmiate frequency of the element +--------------------------------------- + +.. code:: python + + print(cs.frequency("hello")) + + +.. warning:: + + It is only an approximation of the exact frequency. + + + +Size of the sketch in bytes +---------------------------- + +.. code:: python + + print(cs.sizeof()) + + +Length of the sketch +--------------------- + +.. code:: python + + print(len(cs)) \ No newline at end of file diff --git a/docs/frequency/index.rst b/docs/frequency/index.rst new file mode 100644 index 0000000..7496fa8 --- /dev/null +++ b/docs/frequency/index.rst @@ -0,0 +1,15 @@ +Frequency +============ + +Many important problems with streaming applications that operate large +data streams are related to the estimation of the frequencies of elements, +including determining the most frequent element or detecting the trending +ones over some period of time. + + + +.. toctree:: + :maxdepth: 2 + + count_sketch + count_min_sketch diff --git a/docs/index.rst b/docs/index.rst index e5783cb..d7106ab 100644 --- a/docs/index.rst +++ b/docs/index.rst @@ -39,5 +39,6 @@ GitHub repository: ``_ quickstart cardinality/index + frequency/index membership/index rank/index