Static memory-efficient & fast Trie-like structures for Python (based on marisa-trie C++ library)
Project description
marisa-trie
Static memory-efficient Trie-like structures for Python (2.x and 3.x).
String data in a MARISA-trie may take up to 50x-100x less memory than in a standard Python dict; the raw lookup speed is comparable; trie also provides fast advanced methods like prefix search.
Based on marisa-trie C++ library.
Installation
pip install marisa-trie
Usage
There are several Trie classes in this package:
marisa_trie.Trie - read-only trie-based data structure that maps unicode keys to auto-generated unique IDs;
marisa_trie.RecordTrie - read-only trie-based data structure that maps unicode keys to lists of data tuples. All tuples must be of the same format (the data is packed using python struct module).
marisa_trie.BytesTrie - read-only Trie that maps unicode keys to lists of bytes objects.
marisa_trie.Trie
Create a new trie:
>>> import marisa_trie >>> trie = marisa_trie.Trie([u'key1', u'key2', u'key12'])
Check if key is in trie:
>>> u'key1' in trie True >>> u'key20' in trie False
Each key is assigned an unique ID from 0 to (n - 1), where n is the number of keys; you can use this ID to store a value in a separate structure (e.g. python list):
>>> trie.key_id(u'key2') 1
Key can be reconstructed from the ID:
>>> trie.restore_key(1) u'key2'
Find all prefixes of a given key:
>>> trie.prefixes(u'key12') [u'key1', u'key12']
There is also a generator version of .prefixes method called .iter_prefixes.
Find all keys from this trie that starts with a given prefix:
>> trie.keys(u'key1') [u'key1', u'key12']
(iterator version .iterkeys(prefix) is also available).
marisa_trie.RecordTrie
Create a new trie:
>>> keys = [u'foo', u'bar', u'foobar', u'foo'] >>> values = [(1, 2), (2, 1), (3, 3), (2, 1)] >>> fmt = "<HH" # a tuple with 2 short integers >>> trie = marisa_trie.RecordTrie(fmt, zip(keys, values))
Trie initial data must be an iterable of tuples (unicode_key, data_tuple). Data tuples will be converted to bytes with struct.pack(fmt, *data_tuple).
Take a look at http://docs.python.org/library/struct.html#format-strings for the format string specification.
Duplicate keys are allowed.
Check if key is in trie:
>>> u'foo' in trie True >>> u'spam' in trie False
Get a values list:
>>> trie[u'bar'] [(2, 1)] >>> trie[u'foo'] [(1, 2), (2, 1)] >>> trie.get(u'bar', 123) [(2, 1)] >>> trie.get(u'BAAR', 123) # default value 123
Find all prefixes of a given key:
>>> trie.prefixes(u'foobarz') [u'foo', u'foobar']
Find all keys from this trie that starts with a given prefix:
>> trie.keys(u'fo') [u'foo', u'foo', u'foobar']
Find all items from this trie that starts with a given prefix:
>> trie.items(u'fo') [(u'foo', (1, 2)), (u'foo', (2, 1), (u'foobar', (3, 3))]
marisa_trie.BytesTrie
BytesTrie is similar to RecordTrie, but the values are raw bytes, not tuples:
>>> keys = [u'foo', u'bar', u'foobar', u'foo'] >>> values = [b'foo-value', b'bar-value', b'foobar-value', b'foo-value2'] >>> trie = marisa_trie.BytesTrie(zip(keys, values)) >>> trie[u'bar'] [b'bar-value']
Persistence
Trie objects supports saving/loading, pickling/unpickling and memory mapped I/O.
Write trie to a stream:
>>> with open('my_trie.marisa', 'w') as f: ... trie.write(f)
Save trie to a file:
>>> trie.save('my_trie_copy.marisa')
Read trie from stream:
>>> trie2 = marisa_trie.Trie() >>> with open('my_trie.marisa', 'r') as f: ... trie.read(f)
Load trie from file:
>>> trie2.load('my_trie.marisa')
Trie objects are picklable:
>>> import pickle >>> data = pickle.dumps(trie) >>> trie3 = pickle.loads(data)
You may also build a trie using marisa-build command-line utility (provided by underlying C++ library; it should be downloaded and compiled separately) and then load the trie from the resulting file using .load() method.
Memory mapped I/O
It is possible to use memory mapped file as data source:
>>> trie = marisa_trie.RecordTrie(fmt).mmap('my_record_trie.marisa')
This way the whole dictionary won’t be loaded to memory; memory mapped I/O is an easy way to share dictionary data among processes.
Trie storage options
marisa-trie C++ library provides some configuration options for trie storage; check http://marisa-trie.googlecode.com/svn/trunk/docs/readme.en.html page (scroll down to “Enumeration Constants” section) to get an idea.
These options are exposed as order, num_tries, cache_size and binary keyword arguments for trie constructors.
For example, set order to marisa_trie.LABEL_ORDER in order to make trie functions return results in alphabetical oder:
>>> trie = marisa_trie.RecordTrie(fmt, data, order=marisa_trie.LABEL_ORDER)
Benchmarks
My quick tests show that memory usage is quite decent. For a list of 3000000 (3 million) Russian words memory consumption with different data structures (under Python 2.7):
dict(unicode words -> word lenghts): about 600M
list(unicode words) : about 300M
BaseTrie from datrie library: about 70M
marisa_trie.RecordTrie : 11M
marisa_trie.Trie: 7M
Benchmark results (100k unicode words, integer values (lenghts of the words), Python 3.2, macbook air i5 1.8 Ghz):
dict building 2.919M words/sec Trie building 0.394M words/sec BytesTrie building 0.355M words/sec RecordTrie building 0.354M words/sec dict __getitem__ (hits) 8.239M ops/sec Trie __getitem__ (hits) not supported BytesTrie __getitem__ (hits) 0.498M ops/sec RecordTrie __getitem__ (hits) 0.404M ops/sec dict get() (hits) 4.410M ops/sec Trie get() (hits) not supported BytesTrie get() (hits) 0.458M ops/sec RecordTrie get() (hits) 0.364M ops/sec dict get() (misses) 4.869M ops/sec Trie get() (misses) not supported BytesTrie get() (misses) 0.849M ops/sec RecordTrie get() (misses) 0.816M ops/sec dict __contains__ (hits) 8.053M ops/sec Trie __contains__ (hits) 1.018M ops/sec BytesTrie __contains__ (hits) 0.605M ops/sec RecordTrie __contains__ (hits) 0.618M ops/sec dict __contains__ (misses) 6.489M ops/sec Trie __contains__ (misses) 2.047M ops/sec BytesTrie __contains__ (misses) 1.079M ops/sec RecordTrie __contains__ (misses) 1.123M ops/sec dict items() 57.248 ops/sec Trie items() not supported BytesTrie items() 11.691 ops/sec RecordTrie items() 8.369 ops/sec dict keys() 217.920 ops/sec Trie keys() 19.589 ops/sec BytesTrie keys() 14.849 ops/sec RecordTrie keys() 15.369 ops/sec Trie.prefixes (hits) 0.594M ops/sec Trie.prefixes (mixed) 1.874M ops/sec Trie.prefixes (misses) 1.447M ops/sec RecordTrie.prefixes (hits) 0.103M ops/sec RecordTrie.prefixes (mixed) 0.458M ops/sec RecordTrie.prefixes (misses) 0.164M ops/sec Trie.iter_prefixes (hits) 0.588M ops/sec Trie.iter_prefixes (mixed) 1.470M ops/sec Trie.iter_prefixes (misses) 1.170M ops/sec Trie.keys(prefix="xxx"), avg_len(res)==415 5.044K ops/sec Trie.keys(prefix="xxxxx"), avg_len(res)==17 89.363K ops/sec Trie.keys(prefix="xxxxxxxx"), avg_len(res)==3 258.732K ops/sec Trie.keys(prefix="xxxxx..xx"), avg_len(res)==1.4 293.199K ops/sec Trie.keys(prefix="xxx"), NON_EXISTING 1169.524K ops/sec RecordTrie.keys(prefix="xxx"), avg_len(res)==415 3.836K ops/sec RecordTrie.keys(prefix="xxxxx"), avg_len(res)==17 73.591K ops/sec RecordTrie.keys(prefix="xxxxxxxx"), avg_len(res)==3 229.515K ops/sec RecordTrie.keys(prefix="xxxxx..xx"), avg_len(res)==1.4 269.228K ops/sec RecordTrie.keys(prefix="xxx"), NON_EXISTING 1071.433K ops/sec
Tries from marisa_trie are static and uses less memory, tries from datrie are faster and can be updated.
You may also give DAWG a try - it is usually faster than marisa-trie and sometimes can use less memory (depending on data).
Please take this benchmark results with a grain of salt; this is a very simple benchmark on a single data set.
Current limitations
The library is not tested with mingw32 compiler;
.prefixes() method of BytesTrie and RecordTrie is quite slow;
read() and write() methods don’t work with file-like objects (they work only with real files; pickling works fine for file-like objects);
iterator versions of methods are not always implemented;
there are keys() and items() methods but no values() method.
Contributions are welcome!
Contributing
Development happens at github and bitbucket:
The main issue tracker is at github: https://github.com/kmike/marisa-trie/issues
Feel free to submit ideas, bugs, pull requests (git or hg) or regular patches.
If you found a bug in a C++ part please report it to the original bug tracker.
How is source code organized
There are 4 folders in repository:
bench - benchmarks & benchmark data;
lib - original unmodified marisa-trie C++ library which is bundled for easier distribution; if something is have to be fixed in this library consider fixing it in the original repo ;
src - wrapper code; src/marisa_trie.pyx is a wrapper implementation; src/*.pxd files are Cython headers for corresponding C++ headers; src/*.cpp files are the pre-built extension code and shouldn’t be modified directly (they should be updated via update_cpp.sh script).
tests - the test suite.
Running tests and benchmarks
Make sure tox is installed and run
$ tox
from the source checkout. Tests should pass under python 2.6, 2.7, 3.2 and 3.3.
In order to run benchmarks, type
$ tox -c bench.ini
License
Wrapper code is licensed under MIT License. Bundled marisa-trie C++ library is licensed under BSD license.
CHANGES
0.4 (2013-02-28)
improved trie building: weights optional parameter;
improved trie building: unnecessary input sorting is removed;
wrapper is rebuilt with Cython 0.18;
bundled marisa-trie C++ library is updated to svn r133.
0.3.8 (2013-01-03)
Rebuild wrapper with Cython pre-0.18;
update benchmarks.
0.3.7 (2012-09-21)
Update bundled marisa-trie C++ library (this may fix more mingw issues);
Python 3.3 support is back.
0.3.6 (2012-09-05)
much faster (3x-7x) .items() and .keys() methods for all tries; faster (up to 3x) .prefixes() method for Trie.
0.3.5 (2012-08-30)
Pickling of RecordTrie is fixed (thanks lazarou for the report);
error messages should become more useful.
0.3.4 (2012-08-29)
Issues with mingw32 should be resolved (thanks Susumu Yata).
0.3.3 (2012-08-27)
.get(key, default=None) method for BytesTrie and RecordTrie;
small README improvements.
0.3.2 (2012-08-26)
Small code cleanup;
load, read and mmap methods returns ‘self’;
I can’t run tests (via tox) under Python 3.3 so it is removed from supported versions for now.
0.3.1 (2012-08-23)
.prefixes() support for RecordTrie and BytesTrie.
0.3 (2012-08-23)
RecordTrie and BytesTrie are introduced;
IntTrie class is removed (probably temporary?);
dumps/loads methods are renamed to tobytes/frombytes;
benchmark & tests improvements;
support for MARISA-trie config options is added.
0.2 (2012-08-19)
Pickling/unpickling support;
dumps/loads methods;
python 3.3 workaround;
improved tests;
benchmarks.
0.1 (2012-08-17)
Initial release.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.