Skip to content

Commit

Permalink
Misc doc changes (#268)
Browse files Browse the repository at this point in the history
* wip

* wip

* wip

* wip

* wip
  • Loading branch information
ronanstokes-db authored May 23, 2024
1 parent 02d529e commit b28602d
Show file tree
Hide file tree
Showing 3 changed files with 132 additions and 25 deletions.
6 changes: 6 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,11 @@
## Change History
All notable changes to the Databricks Labs Data Generator will be documented in this file.

### Unreleased

#### Changed
* Updated documentation for generating text data.


### Version 0.3.6 Post 1

Expand All @@ -25,6 +30,7 @@ All notable changes to the Databricks Labs Data Generator will be documented in
* Ths version marks the changing minimum version of Databricks runtime to 10.4 LTS and later releases.
* While there are no known incompatibilities with Databricks 9.1 LTS, we will not test against this release


### Version 0.3.5

#### Changed
Expand Down
26 changes: 26 additions & 0 deletions dbldatagen/column_spec_options.py
Original file line number Diff line number Diff line change
Expand Up @@ -47,6 +47,9 @@ class ColumnSpecOptions(object):
:param baseColumn: Either the string name of the base column, or a list of columns to use to
control data generation. The option ``baseColumns`` is an alias for ``baseColumn``.
:param baseColumnType: Determines how the value is derived from the base column. Possible values are 'auto',
'hash', 'raw_values', 'values'
:param values: List of discrete values for the colummn. Discrete values for the column can be strings, numbers
or constants conforming to type of column
Expand Down Expand Up @@ -105,6 +108,29 @@ class ColumnSpecOptions(object):
:param escapeSpecialChars: if True, require escape for all special chars in template
When a column's value is derived from the value of another column, the `baseColumn` and `baseColumnType` options
can be used to control how the value is derived. The `baseColumn` option can be used to specify the name of the
base column, and the `baseColumnType` option can be used to specify how the value is derived from the base column.
The following values are permitted for the `baseColumnType` option:
- 'auto': Automatically determine the base column type based on the column type of the base column.
- 'hash': Use a hash of the base column(s) value to derive the value of the new column.
- 'raw_values': Use the raw values of the base column to derive the value of the new column.
- 'values': Use the values of the base column to derive the value of the new column.
The `baseColumn` option can be used to specify the name of the base column. If the `baseColumn` option is not
specified, the value of the new column will be derived from the seed or `id` column.
The `baseColumnType` option is optional. If it is not specified, the value of the new column will be derived
based on the column type of the base column.
The derivation from `raw_values` differs from `values` in that the `raw_values` option will use the raw values
of the base column to derive the value of the new column, while the `values` option will use the values of the
base column to derive the value of the new column after scaling to the range or implied range of the new column.
For example a column with four categorical values , 'A', 'B', 'C', 'D' has an implied range of 0 .. 3.
.. note::
If the `dataRange` parameter is specified as well as the `minValue`, `maxValue` or `step`,
the results are undetermined.
Expand Down
125 changes: 100 additions & 25 deletions docs/source/textdata.rst
Original file line number Diff line number Diff line change
Expand Up @@ -28,9 +28,9 @@ The following example illustrates generating data for specific ranges of values:
dg.DataGenerator(sparkSession=spark, name="test_data_set1", rows=100000,
partitions=4, randomSeedMethod="hash_fieldname")
.withIdOutput()
.withColumn("code3", StringType(), values=['online', 'offline', 'unknown'])
.withColumn("code4", StringType(), values=['a', 'b', 'c'], random=True, percentNulls=0.05)
.withColumn("code5", StringType(), values=['a', 'b', 'c'], random=True, weights=[9, 1, 1])
.withColumn("code3", "string", values=['online', 'offline', 'unknown'])
.withColumn("code4", "string", values=['a', 'b', 'c'], random=True, percentNulls=0.05)
.withColumn("code5", "string", values=['a', 'b', 'c'], random=True, weights=[9, 1, 1])
)
Generating text from existing values
Expand Down Expand Up @@ -84,7 +84,7 @@ The following example illustrates its use:
dg.DataGenerator(sparkSession=spark, name="test_data_set1", rows=100000,
partitions=4, randomSeedMethod="hash_fieldname")
.withIdOutput()
.withColumnSpec("sample_text", text=dg.ILText(paragraphs=(1, 4),
.withColumn("sample_text", "string", text=dg.ILText(paragraphs=(1, 4),
sentences=(2, 6)))
)
Expand All @@ -96,7 +96,12 @@ Using the general purpose text generator

The ``template`` attribute allows specification of templated text generation.

Here are some examples of its use to generate dummy email addresses, ip addressed and phone numbers
.. note ::
The ``template`` option is shorthand for ``text=dg.TemplateGenerator(template=...)``
This can be specified with different options covering how escapes are handled and customizing the word list
- see the `TemplateGenerator` documentation for more details.
.. code-block:: python
Expand All @@ -105,27 +110,25 @@ Here are some examples of its use to generate dummy email addresses, ip addresse
dg.DataGenerator(sparkSession=spark, name="test_data_set1", rows=100000,
partitions=4, randomSeedMethod="hash_fieldname")
.withIdOutput()
.withColumnSpec("email",
.withColumn("email", "string",
template=r'\w.\w@\w.com|\w@\w.co.u\k')
.withColumnSpec("ip_addr",
.withColumn("ip_addr", "string",
template=r'\n.\n.\n.\n')
.withColumnSpec("phone",
.withColumn("phone", "string",
template=r'(ddd)-ddd-dddd|1(ddd) ddd-dddd|ddd ddddddd')
# the following implements the same pattern as for `phone` but using the `TemplateGenerator` class
.withColumn("phone2", "string",
text=dg.TemplateGenerator(r'(ddd)-ddd-dddd|1(ddd) ddd-dddd|ddd ddddddd'))
)
df = df_spec.build()
num_rows=df.count()
The implementation of the template expansion uses the underlying `TemplateGenerator` class.

.. note ::
The ``template`` option is shorthand for ``text=dg.TemplateGenerator(template=...)``
This can be specified in multiple modes - see the `TemplateGenerator` documentation for more details.
TemplateGenerator options
---------------------------------------------
-------------------------

The template generator generates text from a template to allow for generation of synthetic credit card numbers,
VINs, IBANs and many other structured codes.
Expand Down Expand Up @@ -154,9 +157,27 @@ It uses the following special chars:
W Insert a random uppercase word from the ipsum lorem word set. Always escaped
======== ======================================

In all other cases, the char itself is used.

The setting of the ``escapeSpecialChars`` determines how the template generate interprets the special chars.

If set to False, which defaults to `False`, then the special char does not need to be escaped to have its special
meaning. But the special char must be escaped to be treated as a literal char.

So the template ``r"\dr_\v"`` will generate the values ``"dr_0"`` ... ``"dr_999"`` when used via the template option
and applied to the values zero to 999.
Here the the character `d` is escaped to avoid interpretation as a special character.

If set to True, then the special char only has its special meaning when preceded by an escape.

So the option `text=dg.TemplateGenerator(r'dr_\v', escapeSpecialChars=True)` will generate the values
``"dr_0"`` ... ``"dr_999"`` when applied to the values zero to 999.

This conforms to earlier implementations for backwards compatibility.

.. note::
If escape is used and ``escapeSpecialChars`` is False, then the following
char is assumed to have no special meaning.
Setting the argument `escapeSpecialChars=False` means that the special char does not need to be escaped to
be treated as a special char. But it must be escaped to be treated as a literal char.

If the ``escapeSpecialChars`` option is set to True, then the following char only has its special
meaning when preceded by an escape.
Expand All @@ -165,20 +186,74 @@ It uses the following special chars:

A special case exists for ``\\v`` - if immediately followed by a digit 0 - 9, the underlying base value
is interpreted as an array of values and the nth element is retrieved where `n` is the digit specified.

The ``escapeSpecialChars`` is set to False by default for backwards compatibility.

To use the ``escapeSpecialChars`` option, use the variant
``text=dg.TemplateGenerator(template=...), escapeSpecialChars=True``
``text=dg.TemplateGenerator(template=..., escapeSpecialChars=True)``

In all other cases, the char itself is used.

The setting of the ``escapeSpecialChars`` determines how templates generate data.
Using a custom word list
^^^^^^^^^^^^^^^^^^^^^^^^

The template generator allows specification of a custom word list also. This is a list of words that can be
used in the template generation. The default word list is the `ipsum lorem` word list.

While the `values` option allows for the specification of a list of categorical values, this is transmitted as part of
the generated SQL. The use of the `TemplateGenerator` object with a custom word list allows for specification of much
larger lists of possible values without the need to transmit them as part of the generated SQL.

For example the following code snippet illustrates the use of a custom word list:

.. code-block:: python
import dbldatagen as dg
names = ['alpha', 'beta', 'gamma', 'lambda', 'theta']
df_spec = (
dg.DataGenerator(sparkSession=spark, name="test_data_set1", rows=100000,
partitions=4, randomSeedMethod="hash_fieldname")
.withIdOutput()
.withColumn("email", "string",
template=r'\w.\w@\w.com|\w@\w.co.u\k')
.withColumn("ip_addr", "string",
template=r'\n.\n.\n.\n')
.withColumn("phone", "string",
template=r'(ddd)-ddd-dddd|1(ddd) ddd-dddd|ddd ddddddd')
# implements the same pattern as for `phone` but using the `TemplateGenerator` class
.withColumn("phone2", "string",
text=dg.TemplateGenerator(r'(ddd)-ddd-dddd|1(ddd) ddd-dddd|ddd ddddddd'))
# uses a custom word list
.withColumn("name", "string",
text=dg.TemplateGenerator(r'\w \w|\w \w \w|\w \a. \w',
escapeSpecialChars=True,
extendedWordList=names))
)
df = df_spec.build()
display(df)
Here the `names` variable is a list of names that can be used in the template generation.

While this is short list in this case, it could be a much larger list of names either
specified as a literal, or read from another dataframe, file, table or produced from another source.

As this is not transmitted as part of the generated SQL, it allows for much larger lists of possible values.

Other forms of text value lookup
--------------------------------

The use of the `values` option and the `template` option with a `TemplateGenerator` instance allow for generation of
data when the range of possible values is known.

If set to False, then the template ``r"\\dr_\\v"`` will generate the values ``"dr_0"`` ... ``"dr_999"`` when applied
to the values zero to 999. This conforms to earlier implementations for backwards compatibility.
But what about scenarios when the list of data is read from a different table or some other form of lookup?

If set to True, then the template ``r"dr_\\v"`` will generate the values ``"dr_0"`` ... ``"dr_999"``
when applied to the values zero to 999. This conforms to the preferred style going forward
As the output of the data generation `build()` method is a regular PySpark DataFrame, it is possible to join the
generated data with other data sources to generate the required data.

In these cases, the generator can be specified to produce lookup keys that can be used to join with the
other data sources.

0 comments on commit b28602d

Please sign in to comment.