This lets UserPhrasesLM consumes as much user data as possible before
bailing. This makes it more tolerant to data errors and will not fail
entirely just because the user has one faulty line in a data file.
Also removes FastFM from the benchmarking suite.
This also runs the CMake-based C++ tests as part of the GitHub CI.
We take advantage of the fact that no one is able to modify the phrase
databases shipped with the binary (guranteed by macOS's integrity check
for notarized apps), and we can simply pre-sort the phrases in the
database files.
With this change, we can speed up McBopomofo's language model loading
during the app initialization by about 500-800x on a 2018 Intel MacBook
Pro. The LM loading used to take 300-400 ms, but now it's done within a
sub-millisecond range (0.5-0.6 ms). Microbenchmarking shows that
ParselessLM is about 16000x faster than FastLM. We amortize the latency
during the query time, and even by deferring the parsing, ParselessLM is
only ~1.5x slower than FastLM, and both LM classes serve queries unedr 6
microseconds (that's 0.006 ms), which means the tradeoff only
contributes to neglible overall latency.
This PR requires some small changes to the phrase db cooking scripts.
Python 3 is now used and the (value, reading, score) tuples are
rearranged to (reading, value, score) and sorted by reading ("key"). A
header is added to the phrase databases to call out the fact that these
are pre-sorted.
clang-format is used to apply WebKit C++ style to the new code. This
also applies to KeyValueBlobReader that was added recently.
Microbenchmark result below:
```
---------------------------------------------------------------------
Benchmark Time CPU Iterations
---------------------------------------------------------------------
BM_ParselessLMOpenClose 17710 ns 17199 ns 33422
BM_FastLMOpenClose 376520248 ns 367526500 ns 2
BM_ParselessLMFindUnigrams 5967 ns 5899 ns 113729
BM_FastLMFindUnigrams 2268 ns 2265 ns 307038
```
A generic key-value blob reader, KeyValueBlobReader, is implemented to
allow more flexibility in user-editable files. For example, this allows
comments in the file, as well as tolerating leading or trailing spaces,
tabs, or even Windows CR LF line endings.
Unit tests are supplied for KeyValueBlobReader although they are not
part of the Xcode project. A separate CMakeLists.txt is provided.
UserPhrasesLM is refactored to use KeyValueBlobReader. A small stylistic
change is appiled to reduce "using namespace" uses, but otherwise no
major style changes were applied to UserPhrasesLM.
Please note that McBopomofo's user phrase LM uses the value in a
key-value pair as the reading, and the key as the actual "value". We
don't plan to change that order so that we don't have to migrate data.
std::string_view is used to allow efficient reference to char buffers
and interop with std::string (and so no c_str() is needed). C++17 is now
enabled for the project to enable the use of std::string_view.
Copyright headers are added to McBopomofoLM and UserPhrasesLM.