Commit Graph

46 Commits

Author SHA1 Message Date
zonble 8ba4b9dfdf Prevents loading data models repeatedly. 2022-01-30 20:27:33 +08:00
zonble 5ba7365cd3 Fixes typos. 2022-01-30 08:26:32 +08:00
zonble c3d953c618 Converts input mode into a typed enum. 2022-01-30 08:06:22 +08:00
zonble 56c393cefa Prevents using global state as possible. 2022-01-27 23:19:27 +08:00
zonble 1ad9e23918 Refactors the input controller. 2022-01-27 22:54:53 +08:00
zonble 177cba5d56 [WIP] Starts to extract input states from the input controller. 2022-01-24 02:13:18 +08:00
Lukhnos Liu 202b1fa058 Also make PhraseReplacementMap more tolerant
This also clarifies the test expectations and how parsing errors are
handled.
2022-01-18 22:46:26 -08:00
Weizhong Yang a.k.a zonble 9bc3536630 Merge branch 'master' into more-tolerant-userphraseslm 2022-01-19 14:01:23 +08:00
Lukhnos Liu c8f65580bb Make UserPhrasesLM more tolerant
This lets UserPhrasesLM consumes as much user data as possible before
bailing. This makes it more tolerant to data errors and will not fail
entirely just because the user has one faulty line in a data file.

Also removes FastFM from the benchmarking suite.

This also runs the CMake-based C++ tests as part of the GitHub CI.
2022-01-18 16:20:25 -08:00
Lukhnos Liu 75f321f088 Update copyright headers (fixes #213) 2022-01-18 14:21:55 -08:00
zonble a75c7b7086 Allows users to type Latin letters while using shift + letter keys.
Fixes issue #162.
2022-01-17 00:48:29 +08:00
zonble 4ec4eed562 Removes unused files. 2022-01-16 15:15:41 +08:00
zonble c4259c4c4e Updates comments and fixes a typo. 2022-01-16 15:04:20 +08:00
zonble 5c0a14deeb Refactors the function to filter and transform unigrams in McBopomofoLM. 2022-01-16 15:04:20 +08:00
zonble b627e8e3b6 Adds an option to let users to choose Chinse conversion style.
Option 0: converts the output.
Option 1: converts the models.
2022-01-16 15:04:20 +08:00
zonble b348a05735 Filters duplicated unigram values properly. 2022-01-16 15:04:18 +08:00
Lukhnos Liu d064f420e4 Use a parseless phrase db to speed up LM loading
We take advantage of the fact that no one is able to modify the phrase
databases shipped with the binary (guranteed by macOS's integrity check
for notarized apps), and we can simply pre-sort the phrases in the
database files.

With this change, we can speed up McBopomofo's language model loading
during the app initialization by about 500-800x on a 2018 Intel MacBook
Pro. The LM loading used to take 300-400 ms, but now it's done within a
sub-millisecond range (0.5-0.6 ms). Microbenchmarking shows that
ParselessLM is about 16000x faster than FastLM. We amortize the latency
during the query time, and even by deferring the parsing, ParselessLM is
only ~1.5x slower than FastLM, and both LM classes serve queries unedr 6
microseconds (that's 0.006 ms), which means the tradeoff only
contributes to neglible overall latency.

This PR requires some small changes to the phrase db cooking scripts.
Python 3 is now used and the (value, reading, score) tuples are
rearranged to (reading, value, score) and sorted by reading ("key"). A
header is added to the phrase databases to call out the fact that these
are pre-sorted.

clang-format is used to apply WebKit C++ style to the new code. This
also applies to KeyValueBlobReader that was added recently.

Microbenchmark result below:

```
---------------------------------------------------------------------
Benchmark                           Time             CPU   Iterations
---------------------------------------------------------------------
BM_ParselessLMOpenClose         17710 ns        17199 ns        33422
BM_FastLMOpenClose          376520248 ns    367526500 ns            2
BM_ParselessLMFindUnigrams       5967 ns         5899 ns       113729
BM_FastLMFindUnigrams            2268 ns         2265 ns       307038
```
2022-01-15 16:15:02 -08:00
zonble 136ac34f22 Introduces in-place phrase replacement.
Since we have implemented the functions to add and exlcude phrases, the
commit allows users to use a table to change the output of a phrase
without changing its BPMF reading and score, when the "phrase replacement"
mode is on.

It could help users to switch a specific input scenario and the ordinary
one. For example, if a user wants to work on financial Chinese numbers
like 壹、貳、參, he or she may want the characters to have higher score
as the normal numbers like 一、二、三. The commit can let the users to
temporarily replace 一、二、三 to 壹、貳、參 by just turn on "phrase
replacement" mode and prepare a custom table.

The conversion is not done on the output phase like how we do
Traditional/Simplified Chinese conversion. What the phrase replacement
table does is to slightly modify the language model. The replacement
takes place on walking the nodes and candidates list.

A user can enable the mode and edit the table from the input menu. Since
the function is quite advanced, the menu items are hidden until the user
holds the option key.

The table is a plain text file. Each line contains a "from" and "to".
For example

```
一 壹
```

However, if the user also want all other phrase contain 一 to become 壹,
all of the phrases have to be built into the table

```
一百 壹佰
一千 壹仟
一萬 壹萬
一百萬 壹百萬
```
2022-01-15 06:23:09 +08:00
Lukhnos Liu d6cc5479f6 Use a more tolerant parser for user phrases
A generic key-value blob reader, KeyValueBlobReader, is implemented to
allow more flexibility in user-editable files. For example, this allows
comments in the file, as well as tolerating leading or trailing spaces,
tabs, or even Windows CR LF line endings.

Unit tests are supplied for KeyValueBlobReader although they are not
part of the Xcode project. A separate CMakeLists.txt is provided.

UserPhrasesLM is refactored to use KeyValueBlobReader. A small stylistic
change is appiled to reduce "using namespace" uses, but otherwise no
major style changes were applied to UserPhrasesLM.

Please note that McBopomofo's user phrase LM uses the value in a
key-value pair as the reading, and the key as the actual "value". We
don't plan to change that order so that we don't have to migrate data.

std::string_view is used to allow efficient reference to char buffers
and interop with std::string (and so no c_str() is needed). C++17 is now
enabled for the project to enable the use of std::string_view.

Copyright headers are added to McBopomofoLM and UserPhrasesLM.
2022-01-13 23:27:31 -08:00
zonble d590d748f8 Adds UserPhrasesLM for user phrases.
Since there is no probability information for users' custom phrases,
they should be stored in a format differs from data.txt. Using the same
format and FastLM to parse user phrases just because of laziness but it
is not the right way.

The pull request adds a new language model class to parse user phrases.
It also update the input method controller to adopt the new user phrase
format.
2022-01-12 16:53:51 +08:00
zonble f1e56a7e01 Lets McBopomofoLM to accept NULL as the parameter in loadUserPhrases. 2022-01-12 13:17:41 +08:00
zonble 84fc2f068b Removes unused code and fixes a typo. 2022-01-12 13:16:10 +08:00
zonble abdf97f652 Adds McBopomofoLM as the facade of three language models.
- main language model
- user phrases
- user excluded phrases
2022-01-12 12:26:24 +08:00
zonble 9b485b799c Implements excluding phrases. 2022-01-12 00:16:55 +08:00
zonble 84849bdb3d Converts the preference and non modal view controller to Swift. 2022-01-10 22:01:40 +08:00
zonble 6bdd2aab44 Fixes a bug on building the unigrams. 2022-01-09 13:00:19 -08:00
zonble b4276f0488 Fixes a bug on building the vector for unigrams from both global language model and user phrases. 2022-01-09 13:00:19 -08:00
zonble e909dc20b5 Uses user phrases in the block builder. 2022-01-09 08:38:32 -08:00
zonble 6f761ecbcd Implements adding phrase from shift and arrow keys. 2022-01-09 08:38:32 -08:00
zonble 358462dff1 [WIP] Starts to work on the user phrases. 2022-01-09 08:38:32 -08:00
ovadmin aeb774a8ed 小幅重構重複的程式碼 2022-01-06 18:28:37 -08:00
ovadmin 3e0e859feb 將用戶選字記憶機制整合入 InputMethodController 2022-01-06 18:28:37 -08:00
ovadmin a17438b67a 修正一些選字機制 C++ 檔案 #include 不完整的問題 2022-01-06 18:28:37 -08:00
Lukhnos Liu fa224c2657 Reset other nodes' fixed state when fixing a node
This fixes a bug that, when a span covers several nodes and a long node
has already been candidate-fixed, fixing a short node does not cause
the walk to reflect the result.

A concrete example:

1. type 高中生.
2. move the cursor to 中 and change to 鐘聲: 高鐘聲.
3. with cursor position unchanged, select the candidate to 忠.
4. the expected result should be 高忠生 but instead it is stuck with
   高鐘聲 due to the node representing "鐘聲" is still fixed.

Fixes #54
2020-10-09 22:16:06 -07:00
Lukhnos Liu 71b97f82b3 Simplify candidate fixing by moving code to Grid 2020-10-09 22:16:06 -07:00
Lukhnos Liu 8058f37fff Modernize project and bump min version to 10.10
32-bit architecture support is removed as a result.
2018-11-24 21:47:15 -08:00
Lukhnos Liu b4eea515c3 Fix Span removal bug when linked against libc++ 2013-06-14 23:54:37 -07:00
Mengjuei beee34b96c Enable IBM Keyboard Layout, no update to xib yet 2012-11-13 00:40:26 -08:00
Lukhnos Liu c300e9cc10 Detab source code. 2012-10-31 22:12:50 -07:00
Lukhnos Liu e68845381c Revise DFA for parsing language models. 2012-10-31 21:55:13 -07:00
Lukhnos Liu 362801eb6c Remove SimpleLM. 2012-09-10 23:27:00 -07:00
Lukhnos Liu 67775e3ccf Implement an mmap-based LM parser. 2012-09-10 22:55:40 -07:00
Lukhnos Liu 71921b848a Use stable sort in the engine.
So that unigram nodes with the same log probability are sorted
according to the order in which they were added to the language
model.
2012-09-10 19:02:24 -07:00
Mengjuei 7476edf12a 最多使用六個自來組成一個詞 2011-10-18 16:06:51 -07:00
Mengjuei Hsieh 8549045ef5 Accepting 5-char phrases 2011-10-01 10:20:18 -07:00
Mengjuei Hsieh 5f976e4642 first commit 2011-09-01 23:56:26 -07:00