Constructed a spell checker in Golang powered by the very efficient bloom filters.
A Bloom filter is a probabilistic data structure that provides a space-efficient way to test whether an element is a member of a set. Bloom filters are particularly useful for applications where the amount of data is large, and memory efficiency is crucial. More about bloom filters and the math behind them can be found here. A few useful features of bloom filters are as follows:
- Bloom filters are very space efficient compared to other data structures like hash tables.
- Bloom filters are probabilistic in nature. Bloom filters can yield false positives but never false negatives. This means they can tell us if an element is definitely not in the set or if it is possibly in the set.
- We will use this feature for the spell checker. All the common english words are inserted into the bloom filter. For a new word, if it is not present in the bloom filter, we conclude that it is misspelt. Otherwise the word may be correctly spelt ('maybe' because of possible hashing colisions).
- Clone this repository and navigate to it.
git clone https://github.com/manavgondalia/Bloom-Filter-Golang.git
cd Bloom-Filter-Golang
-
Get the required third-party packages using the command
go mod tidy
-
The program is stored in
main.go
. The accepted flags are load (default 0, use 1 for loading bloom filter from disk), build (Path to dictionary file to build Bloom filter), bf (Path to Bloom filter file), fp (False probability rate), n (Number of elements to be inserted). -
Using in non-loading mode: The load flag is by default 0. Specify any other requirements as stated above and hit enter to run. A new bloom filter will be saved in words.bf file This bloom filter can later be used without the need of making a new bloom filter for existing data
-
Using in loading mode: Run the program using the command
go run main.go -load=1
This will load the bloom filter from the file and apply it on the words present intesting.txt
The number of hash functions required is calculated from the bitset size and the number of elements and is dynamic. More hash functions are generated by chaning the seed with respect to the i'th hash function.
The ability to add/load the bloom filter to/from disk saves a lot of calculation time.