No project description provided
Project description
datapatch
A Python library for defining rule-based overrides on messy data. Imagine, for example,
trying to import a dataset in each row is associated with a country - which have been
entered by humans. You might find country names like Northkorea
, or Greet Britain
that you want to normalise. datapatch
creates a mechanism to build a flexible lookup
table (usually stored as a YAML file) to catch and repair these data issues.
Installation
You can install datapatch
from the Supercacher Package Index:
pip install datapatch
Example
Given a YAML file like this:
countries:
normalize: true
lowercase: true
options:
- match: Frankreich
value: France
- match:
- Northkorea
- Nordkorea
- Northern Korea
- NKorea
- DPRK
value: North Korea
- contains: Britain
value: Great Britain
The file can be used to apply the data patches against raw input:
from datapatch import read_lookups, LookupException
lookups = read_lookups("countries.yml")
countries = lookups.get("countries")
# This will apply the patch or default to the original string if none exists:
for row in iter_data():
raw = row.get("Country")
row["Country"] = countries.get_value(raw, default=raw)
Extended options
There's a host of options available to configure the application of the data patches:
countries:
# If you mark a lookup as required, a value that matches no options will
# throw a `datapatch.exc:LookupException`.
required: true
# Normalisation will remove many special characters, remove multiple spaces
# and perform some basic matching across alphabets (Путин -> Putin).
normalize: false
options:
- match: Francois
value: France
# This is a shorthand for defining options that have just one `match` and
# one `value` defined:
map:
Luxemborg: Luxembourg
Lux: Luxembourg
Result objects
You can also have more details associated with a result and access them:
countries:
options:
- match: Frankreich
# These can be arbitrary attributes:
label: France
code: FR
This can be accessed as a result object with attributes:
from datapatch import read_lookups, LookupException
lookups = read_lookups("countries.yml")
countries = lookups.get("countries")
result = countries.match("Frankreich")
print(result.label, result.code)
assert result.capital is None, result.capital
License
datapatch
is licensed under the terms of the MIT license, which is included as
LICENSE
.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for datapatch-1.0.0-py2.py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 7f4aec7bd34af4ed197cb13c3b903429ebf2ad363a25eef1bf19039102d1e662 |
|
MD5 | e0ba5ac20acd21797ff5999a26cd936c |
|
BLAKE2b-256 | 1effd9585c9f03959b34c7a00d5dbecda644568a64c6f41a1195a19cd538fc24 |