The "absolute order" is a compact representation, originally a
historical carry-over from VanillaInput (2004). Modern input methods
no longer need such a compact form. It is therefore now removed.
The setting only works when "SelectPhraseAfterCursorAsCandidate" is also
on. When a user use the mode, it is very possible that he or she has
already something in the input buffer and go back to choose a candidate.
When he or she completes selection, the user may want to go back to the
end and continue inputting. The setting is a time saver.
Older Swift compiler does not allow declaring certain variables that
have the same names with those outside of their scope, even though the
scoping rules should allow them. This makes the code buildable with
Xcode 12.4 again.
This removes one overengineered method from BopomofoSyllable and
rewrites a helper using a simpler UTF-8 heuristics.
Also adds the CMake project file and a unit test suite.
McBopomofo allows users to input pheases with a different length of the
characters and Bopomofo readings, for example, users can input ∴ with ㄙㄨㄛˇ-ㄧˇ.
When the cursor if between ㄙㄨㄛˇ and ㄧˇ, the users have no clue where
the cursor exactly is. The tooltip is to tell the users the cursor is
now betwen ㄙㄨㄛˇ and ㄧˇ.
Since we use states manage the input flow in McBopomofo, implementing this function becomes easy. What I did is to create a new state, Associated Phrases state, and let the key handler to emit such a state just after emitting a Committing state.
When the input method controller is under Associated Phrase state, it shows the candidate window with a tooltip, and only accept candidate keys with the shift key. The key handler uses the characters without modifiers in an NSEvent object to find if there is any matching candidate label, so I added a new member "inputTextIgnoringModifiers" to KeyHandlerInput.
I use KeyValueBlobReader to read the associated phrases. I use the cin file from OpenVanilla project but removed the head and tail of the file to pass KeyValueBlobReader's validation.
This is so that the Installer will be built with the correct Swift
settings, especially those that instruct the Xcode to package the Swift
runtime libraries. This is needed because the Installer now depends on
InputSourceHelper, which is written in Swift. Without this, the app
would not be packaged with the Swift runtime libraries, which caused
the installer to be unusable on older but supported macOS versions.
This lets UserPhrasesLM consumes as much user data as possible before
bailing. This makes it more tolerant to data errors and will not fail
entirely just because the user has one faulty line in a data file.
Also removes FastFM from the benchmarking suite.
This also runs the CMake-based C++ tests as part of the GitHub CI.
Now that we allow comments in the custom data files, this change writes
localized templates as well as basic instructions. Links to McBopomofo
User's Manual are also provided.
We take advantage of the fact that no one is able to modify the phrase
databases shipped with the binary (guranteed by macOS's integrity check
for notarized apps), and we can simply pre-sort the phrases in the
database files.
With this change, we can speed up McBopomofo's language model loading
during the app initialization by about 500-800x on a 2018 Intel MacBook
Pro. The LM loading used to take 300-400 ms, but now it's done within a
sub-millisecond range (0.5-0.6 ms). Microbenchmarking shows that
ParselessLM is about 16000x faster than FastLM. We amortize the latency
during the query time, and even by deferring the parsing, ParselessLM is
only ~1.5x slower than FastLM, and both LM classes serve queries unedr 6
microseconds (that's 0.006 ms), which means the tradeoff only
contributes to neglible overall latency.
This PR requires some small changes to the phrase db cooking scripts.
Python 3 is now used and the (value, reading, score) tuples are
rearranged to (reading, value, score) and sorted by reading ("key"). A
header is added to the phrase databases to call out the fact that these
are pre-sorted.
clang-format is used to apply WebKit C++ style to the new code. This
also applies to KeyValueBlobReader that was added recently.
Microbenchmark result below:
```
---------------------------------------------------------------------
Benchmark Time CPU Iterations
---------------------------------------------------------------------
BM_ParselessLMOpenClose 17710 ns 17199 ns 33422
BM_FastLMOpenClose 376520248 ns 367526500 ns 2
BM_ParselessLMFindUnigrams 5967 ns 5899 ns 113729
BM_FastLMFindUnigrams 2268 ns 2265 ns 307038
```
Since we have implemented the functions to add and exlcude phrases, the
commit allows users to use a table to change the output of a phrase
without changing its BPMF reading and score, when the "phrase replacement"
mode is on.
It could help users to switch a specific input scenario and the ordinary
one. For example, if a user wants to work on financial Chinese numbers
like 壹、貳、參, he or she may want the characters to have higher score
as the normal numbers like 一、二、三. The commit can let the users to
temporarily replace 一、二、三 to 壹、貳、參 by just turn on "phrase
replacement" mode and prepare a custom table.
The conversion is not done on the output phase like how we do
Traditional/Simplified Chinese conversion. What the phrase replacement
table does is to slightly modify the language model. The replacement
takes place on walking the nodes and candidates list.
A user can enable the mode and edit the table from the input menu. Since
the function is quite advanced, the menu items are hidden until the user
holds the option key.
The table is a plain text file. Each line contains a "from" and "to".
For example
```
一 壹
```
However, if the user also want all other phrase contain 一 to become 壹,
all of the phrases have to be built into the table
```
一百 壹佰
一千 壹仟
一萬 壹萬
一百萬 壹百萬
```
A generic key-value blob reader, KeyValueBlobReader, is implemented to
allow more flexibility in user-editable files. For example, this allows
comments in the file, as well as tolerating leading or trailing spaces,
tabs, or even Windows CR LF line endings.
Unit tests are supplied for KeyValueBlobReader although they are not
part of the Xcode project. A separate CMakeLists.txt is provided.
UserPhrasesLM is refactored to use KeyValueBlobReader. A small stylistic
change is appiled to reduce "using namespace" uses, but otherwise no
major style changes were applied to UserPhrasesLM.
Please note that McBopomofo's user phrase LM uses the value in a
key-value pair as the reading, and the key as the actual "value". We
don't plan to change that order so that we don't have to migrate data.
std::string_view is used to allow efficient reference to char buffers
and interop with std::string (and so no c_str() is needed). C++17 is now
enabled for the project to enable the use of std::string_view.
Copyright headers are added to McBopomofoLM and UserPhrasesLM.
Since there is no probability information for users' custom phrases,
they should be stored in a format differs from data.txt. Using the same
format and FastLM to parse user phrases just because of laziness but it
is not the right way.
The pull request adds a new language model class to parse user phrases.
It also update the input method controller to adopt the new user phrase
format.
The reference of the global language models were stored in the class
InputMethodController, however, the global models are global but not a
part of the input method controller, and the input method controller
only use one of the models (McBopomofo/Plain Bopomofo). I guess it
somehow violates SRP and there should be a better place for the global
models.
There was a legacy user override model which creates a folder and a
plist file. If a user uses McBopomofo for years, the folder would
exist. However, when the old override model was removed, I forgot
to create the folder for the new user phrase file.
The bug would let the users with new installation of McBopomofo unable
to add user phrases.