Python bindings to marisa-trie (unofficial)
Project description
marisa-trie
MARISA-Trie structure for Python (2.x and 3.x). Uses marisa-trie C++ library.
MARISA-Trie is a read-only trie that is very memory efficient.
There are official SWIG-based Python bindings included in C++ library distribution; this package provides an alternative unofficial Cython-based pip-installable Python bindings.
Installation
pip install marisa-trie
Usage
Create a new trie:
>>> import marisa_trie >>> trie = marisa_trie.Trie()
Build a trie:
>>> trie.build([u'key1', u'key2', u'key12']) <marisa_trie.Trie at ...>
Check if key is in trie:
>>> u'key1' in trie True >>> u'key20' in trie False
Each key is assigned an unique ID from 0 to (n - 1), where n is the number of keys; you can use this ID to store a value in a separate structure (e.g. python list):
>>> trie.key_id(u'key2') 1
Key can be reconstructed from the ID:
>>> trie.restore_key(1) u'key2'
Find all prefixes of a given key:
>>> trie.prefixes(u'key12') [u'key1', u'key12']
There is also a generator version of .prefixes method called .iter_prefixes.
Find all keys from this trie that starts with a given prefix:
>> trie.keys(u'key1') [u'key1', u'key12']
(iterator version .iterkeys(prefix) is also available).
It is possible to save a trie to a file:
>>> with open('my_trie.marisa', 'w') as f: ... trie.write(f)
or:
>>> trie.save('my_trie_copy.marisa')
Load a trie:
>>> trie2 = marisa.Trie() >>> with open('my_trie.marisa', 'r') as f: ... trie.load(f)
or:
>>> trie2.load('my_trie.marisa')
Alternatively, you could build a trie using marisa-build command-line utility (provided by underlying C library; it should be downloaded and compiled separately) and then load it from a file.
Benchmarks
There are no dedicated benchmarks for this package yet.
My quick tests show that memory usage is quite decent. For a list of 3000000 (3 million) Russian words memory consumption with different data structures (under Python 2.7):
list(unicode words) : about 300M
Trie from datrie library: about 70M
marisa_trie.Trie: 7M
Lookup speed seems to be about 2x slower than with datrie, but I haven’t checked this with a good benchmark suite.
Contributing
Development happens at github and bitbucket:
The main issue tracker is at github: https://github.com/kmike/marisa-trie/issues
Feel free to submit ideas, bugs, pull requests (git or hg) or regular patches.
If you found a bug in a C++ part please report it to the original bug tracker.
Running tests and benchmarks
Make sure tox is installed and run
$ tox
from the source checkout. Tests should pass under python 2.6, 2.7, 3.2 and 3.3.
$ tox -c bench.ini
runs benchmarks.
License
Wrapper code is licensed under MIT License. Bundled marisa-trie C++ library is licensed under BSD license.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.