-
Notifications
You must be signed in to change notification settings - Fork 110
Importing
Diagram accurate as of commit d9cbb1a299affed4d543eb06704b65679ab34d96, Synth version 0.6.4, January 11th 2022
Synth has the ability to take some existing data in some format, and from it produce an approprimate namespace that would generate such data. Users can call upon this functionality using the synth import
command.
As explained in the generating page, the structopt
crate is used to parse command-line arguments. Once it has been established that the command is an 'import' command, an instance of DataSourceParams
is created and used to match to an instance of a structure implementing ImportStrategy
based on the --from
argument specified by the user.
The different import strategy types naturally all have different approaches to reading data. For example, the JsonFileImportStrategy
type simply reads text data from some file and parses it using serde_json
, while the PostgresImportStrategy
uses sqlx
to interface with and read data from some PostreSQL database. There is a unique ...ImportStrategy
structure for each database integration and two for each text-based format (one for reading from a file and for reading from STDIN).
Regardless of how the data (as a Content
instances) is obtained, the individual pieces of data next need to be merged. This is because it is possible to generate a more accurate schema by taking many samples of the original data. Consider the following JSON Lines data:
{ "a": 42 }
{ "a": 21 }
{ "a": 1 }
{ "a": 76 }
{ "a": 100 }
Looking at the first record alone, we might assume that "a"
is a constant 42
, however by considering all the records we can determine that "a"
is more likely a random integer in the range of 1
to 100
. Synth imports multiple records like this by first creating a collection/schema that describes the first record, and then iteratively improving that schema by considering subsequent records.
Merging is done through the MergeStrategy
trait and OptionalMergeStrategy
structure. The trait describes an interface by which two types may be merged while OptionalMergeStrategy
holds the actual implementation of this process. In short, the existing schema and the value being merged are traversed in tandem, with any misalignments between values causing the schema to be updated accordingly. Considering the example data above, the initial schema created would describe a constant integer 42
. After merging with the second record however, the schema would be updated to describe an integer between 21
and 42
. After merging with the third record, this range will again change to 1
and 42
. Once all inputted records have been considered, the produced Synth collection schema should model the input data relatively accurately.