-
-
Notifications
You must be signed in to change notification settings - Fork 367
Adding your own Regex
Need help? Visit our discord! discord.skerritt.blog
-
✍️ Writing and testing regex
Search for a few examples of what you want to match and head over to a tool like regex101.com, where you can write and test the regex to make sure it works as it should.
-
🗄️ Adding it to PyWhat database
When you have the regex, you can add it to our regex database, regex.json.Entry example:
{ "Name": "HackTheBox Flag Format", "Regex": "(?i)^(hackthebox{.*}|htb{.*})$", "plural_name": false, "Description": "Used for Capture The Flags at https://hackthebox.eu", "Rarity": 1, "URL": null, "Tags": [ "CTF Flag" ], "Examples": { "Valid": [ "htb{just_a_test}" ], "Invalid": [] } },
Some notes on what each key means:
-
name (string): The name of the regex, I.E. IPv4 Address.
-
Regex (string): the regex itself. The regex should be in
^(your regex)$
format. -
plural_name (boolean): if the name has to be plural (latitude & longitude coordinates) please set this to True.
-
Description (string): If the name is confusing, please add a description. "Social Security Number" -- the description would be a Wikipedia link explaining it's an American thing. Please do not include a description if the name is not confusing.
-
URL (string): If there is a way to analyse the text online, put it here. When PyWhat finds a match, it will take that match and add it to the end of the URL. This way the URLs open directly that match and users can analyse it.
For example, when there is a Bitcoin match, a direct link to see transactions to and from that address is created and shown to users:
https://www.blockchain.com/btc/address/
+1KFHE7w8BhaENAswwryaoccDb6qcT6DbYY
=https://www.blockchain.com/btc/address/1KFHE7w8BhaENAswwryaoccDb6qcT6DbYY
-
rarity (float between 1.0 and 0.0): How unlikely is it to be a false-positive? Think about how big of a chance there is of something completely different to match to your regex. Choose 1 for very unlikely, 0 for very likely.
❗ Please place your regex in the file in the order of rarity. Rarity of 1 will go at the top, rarity of 0 will go at the bottom.
Some tips on how to pick rarity:
- 1 - contains a word that is unique to it
- 0.7 - matches to a specific pattern and characters
- 0.5 - mostly matching to specific characters
- 0.3 - pretty broad, has only a few specific characters
- 0.2 - broad, almost no specific characters
- 0 - matches to almost everything
-
Tags (list of strings): A list of tags for the regex. Like group name for multiple regexes. Users can the filter and run only If we already have a similar regex, please use those tags.
-
-
🐛 Testing
Every regex has to be tested, so that we can ensure that any future modifications don't unnoticeably break the regex. There is another key in the database called Examples. We use it for our automated testing.-
Examples: Valid (list of strings): Contains examples that should match your regex.
-
Examples: Invalid (list of strings): Contains examples that look similar but are wrong and should not be matched. Maybe you encountered some examples that are wrong and had to modify the regex for it to not match. Add that example here!
-
-
✅ Making sure CI checks all pass:
- Regex tests - Make sure your added tests work.
-
Sorted rarity - As noted above regex has to be sorted by rarity in the
regex.json
. -
Regex format - Your regex has to have the
^(your regex)$
format. - Missing keys - All database keys noted above have to be specified for every regex.
Run
pytest
command in the PyWhat root directory to test. We also use different tools such asisort
for sorting imports,mypy
for type checking, andblack
for correct formatting. You may have to install and run some of them if CI check fails.
-
Now it is time to submit your pull request (PR)!
CI check status will be shown on the PR. Maintainers will review your PR, make suggestions if needed and merge it into PyWhat when it is ready. All CI checks need to pass for merge to happen.
PyWhat can not only identify information but also identify something in the already identified text. Let us use the international phone number as an example. We have a number like +1-202-555-0156
, What identifies this text as a phone number, and it also identifies the country of that number:
This is done by using subcategories.
Here is how the phone number's regex looks like:
{
"Name": "Phone Number",
"Regex": "^(\\s*(?:\\+?(\\d{1,3}))?[-. (]*(\\d{3})[-. )]*(\\d{3})[-. ]*(\\d{4})(?: *x(\\d+))?\\s*)$",
"plural_name": false,
"Description": null,
"Rarity": 0.5,
"URL": null,
"Tags": [
"Identifiers",
"Credentials",
"Phone Number"
],
"Children": {
"path": "phone_codes.json",
"entry": "Location(s): ",
"method": "hashmap"
}
},
Note the "Children" part of it - it has all the information required:
"entry"
- specifies what string will appear in the description.
"path"
- path to a subcategories file that is in the Data
directory. The file should look like this:
{
"+93": "Afghanistan",
"+358": "Finland",
...
}
This file contains elements in a first:second
format.
"deletion_pattern"
- it is optional and specifies the regex pattern that will be applied to the identified information before searching for subcategories.
For example, the MasterCard number can be without spaces (like 5409010000000004) or with (like 5409 0100 0000 0004). By using "deletion_pattern": "\\s"
the whitespace is removed, so the subcategory checks will run on a string like 5409010000000004
even if it was 5409 0100 0000 0004
before.
"method"
- can be "regex", "startswith" or "hashmap"
-
"regex"
-first
is a regex. If the regex matches on a string, thesecond
will be included in the description -
"startswith"
- if string starts withfirst
,second
will be included in the description -
"hashmap"
- Optimized version ofstartswith
that uses hashmaps.
Take a look at regex.json for more examples.