We take advantage of the fact that no one is able to modify the phrase
databases shipped with the binary (guranteed by macOS's integrity check
for notarized apps), and we can simply pre-sort the phrases in the
database files.
With this change, we can speed up McBopomofo's language model loading
during the app initialization by about 500-800x on a 2018 Intel MacBook
Pro. The LM loading used to take 300-400 ms, but now it's done within a
sub-millisecond range (0.5-0.6 ms). Microbenchmarking shows that
ParselessLM is about 16000x faster than FastLM. We amortize the latency
during the query time, and even by deferring the parsing, ParselessLM is
only ~1.5x slower than FastLM, and both LM classes serve queries unedr 6
microseconds (that's 0.006 ms), which means the tradeoff only
contributes to neglible overall latency.
This PR requires some small changes to the phrase db cooking scripts.
Python 3 is now used and the (value, reading, score) tuples are
rearranged to (reading, value, score) and sorted by reading ("key"). A
header is added to the phrase databases to call out the fact that these
are pre-sorted.
clang-format is used to apply WebKit C++ style to the new code. This
also applies to KeyValueBlobReader that was added recently.
Microbenchmark result below:
```
---------------------------------------------------------------------
Benchmark Time CPU Iterations
---------------------------------------------------------------------
BM_ParselessLMOpenClose 17710 ns 17199 ns 33422
BM_FastLMOpenClose 376520248 ns 367526500 ns 2
BM_ParselessLMFindUnigrams 5967 ns 5899 ns 113729
BM_FastLMFindUnigrams 2268 ns 2265 ns 307038
```
Since we have implemented the functions to add and exlcude phrases, the
commit allows users to use a table to change the output of a phrase
without changing its BPMF reading and score, when the "phrase replacement"
mode is on.
It could help users to switch a specific input scenario and the ordinary
one. For example, if a user wants to work on financial Chinese numbers
like 壹、貳、參, he or she may want the characters to have higher score
as the normal numbers like 一、二、三. The commit can let the users to
temporarily replace 一、二、三 to 壹、貳、參 by just turn on "phrase
replacement" mode and prepare a custom table.
The conversion is not done on the output phase like how we do
Traditional/Simplified Chinese conversion. What the phrase replacement
table does is to slightly modify the language model. The replacement
takes place on walking the nodes and candidates list.
A user can enable the mode and edit the table from the input menu. Since
the function is quite advanced, the menu items are hidden until the user
holds the option key.
The table is a plain text file. Each line contains a "from" and "to".
For example
```
一 壹
```
However, if the user also want all other phrase contain 一 to become 壹,
all of the phrases have to be built into the table
```
一百 壹佰
一千 壹仟
一萬 壹萬
一百萬 壹百萬
```
A generic key-value blob reader, KeyValueBlobReader, is implemented to
allow more flexibility in user-editable files. For example, this allows
comments in the file, as well as tolerating leading or trailing spaces,
tabs, or even Windows CR LF line endings.
Unit tests are supplied for KeyValueBlobReader although they are not
part of the Xcode project. A separate CMakeLists.txt is provided.
UserPhrasesLM is refactored to use KeyValueBlobReader. A small stylistic
change is appiled to reduce "using namespace" uses, but otherwise no
major style changes were applied to UserPhrasesLM.
Please note that McBopomofo's user phrase LM uses the value in a
key-value pair as the reading, and the key as the actual "value". We
don't plan to change that order so that we don't have to migrate data.
std::string_view is used to allow efficient reference to char buffers
and interop with std::string (and so no c_str() is needed). C++17 is now
enabled for the project to enable the use of std::string_view.
Copyright headers are added to McBopomofoLM and UserPhrasesLM.
Since there is no probability information for users' custom phrases,
they should be stored in a format differs from data.txt. Using the same
format and FastLM to parse user phrases just because of laziness but it
is not the right way.
The pull request adds a new language model class to parse user phrases.
It also update the input method controller to adopt the new user phrase
format.
The reference of the global language models were stored in the class
InputMethodController, however, the global models are global but not a
part of the input method controller, and the input method controller
only use one of the models (McBopomofo/Plain Bopomofo). I guess it
somehow violates SRP and there should be a better place for the global
models.
There was a legacy user override model which creates a folder and a
plist file. If a user uses McBopomofo for years, the folder would
exist. However, when the old override model was removed, I forgot
to create the folder for the new user phrase file.
The bug would let the users with new installation of McBopomofo unable
to add user phrases.
Previously only the x value was used to determine the screen to which a
candidate panel should below. That was incorrect. The entire point needs
to be considered.
This fixes the same issue that affected OpenVanilla:
https://github.com/openvanilla/openvanilla/issues/49
We now let the Installer to call the TextInputSources API. Since macOS
12, users are prompted to allow enabling of third-party IMEs in
Preferences.app the momemnt TISRegisterInputSource or
TISEnableInputSource is called. By moving the activation to the
Installer, a user will clearly see that it's the Installer that wants to
enable the IME.
In addition, we had to make necessary changes so that on macOS 12 and
later, the Installer always enable the default input source. This is due
to the observation that the kTISPropertyInputSourceIsEnabled becomes
unreliable on macOS 12--it may be true even if the user has removed the
input mode from their active input mode list in Preferences.app.
This ensures that, after the Installer has killed the current input method
process, the Installer can tell if the translocated input method bundle is no
longer mounted. It turns out that getfsstat() may return cached results and a
call to statfs() is necessary.
This fixes the bug that the Installer did not always correctly report that a
new version of the input method has been installed over a previous version.
The bug only manifests when getfsstat() returns cached results. That seems to
be the case on newer versions of macOS.
This fixes a bug that, when a span covers several nodes and a long node
has already been candidate-fixed, fixing a short node does not cause
the walk to reflect the result.
A concrete example:
1. type 高中生.
2. move the cursor to 中 and change to 鐘聲: 高鐘聲.
3. with cursor position unchanged, select the candidate to 忠.
4. the expected result should be 高忠生 but instead it is stuck with
高鐘聲 due to the node representing "鐘聲" is still fixed.
Fixes#54
Soon notarization will be required for Developer ID apps. This change allows
the Installer to run in two modes. The "dev mode" still builds the IME as
the prerequisite of the Installer and places the IME app bundle inside the
Installer's resources folder. That has been so since the beginning of this
project, and this continues to allow IME developers to test the input method.
On the other hand, if "McBopomofo-r$rev.zip" is placed in the NotarizedArchives
folder and McBopomofo is not built as a dependency of the Installer and the
app bundle is not copied to the resources folder, the Installer then can be
built as a notarizable app (otherwise Xcode wouldn't even let you submit it
for notarization).
To build the distributable Installer, notarize the IME app first, then zip the
app as McBopomofo-r$rev.zip and place that to the NotarizedArchives folder
under Source/Installer. Then build and submit the Installer for notarization.
This is in line with Apple's guideline in
https://developer.apple.com/documentation/xcode/notarizing_your_app_before_distribution/customizing_the_notarization_workflow
("If you distribute your software via a custom third-party installer, you need
two rounds of notarization.")
We don't expect that we make new Installers often enough, and therefore we
don't intend to automate this process via scripting.
Recent versions of Chrome started to rely on whether composing buffer
gets updated after an arrow key event to determine whether to dismiss
(force commit) the composing buffer and handle the arrow key event for
the omnibox URL suggestions.
When Caps Lock is on and when the character code is not printable, we
should simply reject handling such character instead of absorbing it and
inserting the character to the client buffer--not all apps handle those
insertions.
Using numerous NSLog's led to the discovery that when McBopofomo lost
function (as described in #86), -setValue:forTag:client: was often called
not just on the context of the foreground app, but also on the contexts
of the background apps. This led to the theory that calling keyboard
layout override in that method (not a documented way of doing things
anyways) might corrupt the input method context. That we swapped out
language model and the builder when the method got called didn't help.
In this commit, we put back the keyboard layout override code to where
it belongs -- in -activateServer: -- and we now only swap the language
model and re-create the builder if the input method really changes (e.g.
from Bopomofo to Plain Bopomofo, or vice versa).
Similar defensive coding is also used in the function key handler in the
-handleEvent:client: method.
-tableView:objectValueForTableColumn:row: may call -layoutCandidateView,
which in turn may force the table view to reload; the layout code
should only run after all cell values are provided for to break this
potential cycle.
This is caused by a missing method. Our implementation for
-[NSObject(IMKServerInput) inputText🔑modifiers:client:] was changed
to handleEvent: in 0.9.5.
Interestingly, calling -inputText🔑modifiers:client: somehow worked
when we linked against OS X 10.7 SDK (that was the SDK the 0.9.5
distribution used). This is no longer true with OS X 10.8 SDK.
Also fix two subtle issues:
1. Enter (not Return) key now works in candidate list
2. Cursor index should be compared against builder's length, *not*
composed string's length, because the former is counted in
code point but the latter in UTF-16 units. The composed string's
length might therefore be longer if the string contains
codepoints > U+FFFF, which would cause the cursor mechanism to
be off.
This commit:
* Creates a new top-level Xcode project file
* Renames remaining Lettuce (the original codename) uses to McBopomofo
* Renames English.lproj (the old style locale name) to en.lproj
Now both Left and Right can be used as choose-candidate key. Also the
candidate window now doesn't obscure the vertical text being typed by
moving to the right to the vertical text. Because of this, Left key
feels strange. Adding Right key should give a better mental model.
1. bugfixes with software like Excel and TextWrangler
2. bugfixes with keystroks like Capslock and Shift
3. huge revision of phrase list
4. revision to frequency calculation