A while ago I wrote a program to analyse the vocabulary of my Chinese ebooks. It was very simple: it extracted the text and segregated it into words by lookup in the cedict dictionary and ranked them according to the HSK word lists. But it was crude: it found the longest matching word at each position in the text, then moved on. It found words but often not the intended or even common or sensible words.
I thought about getting word frequency data and choosing the words that were more common. But while searching for word frequency data (cedict doesn't have it) I found articles on Chinese word segregation and found that there are many existing programs.
One I kept seeing was jieba. There are many implementations. My program was written in JavaScript on node, so I looked for JavaScript implementations on npmjs.com. There were many but most of them had not been updated for years. I tried a few but most of them didn't even install, let alone work. I assume they did when they were published but node/npm and their other dependencies have changed so much in the meantime that they didn't install or didn't work with current versions.
One difficulty was that most of them had very little documentation and most of it was in Chinese. Not surprising for software that does Chinese word segregation, but my Chinese isn't good enough that I can read it.
I tried a few packages based on cppjieba which were more recently updated than most. It was still challenging to get them to work, but eventually I got one working and it seemed quite nice: good performance and good results.
But I wanted to understand the algorithm and when I started to look through the cpp code I got lost. Too many layers of objects and abstractions, and no documentation or comments describing the data structures or algorithms.
So I looked some more and found a pure JavaScript implementation. It hadn't been updated for 7 years and didn't work, but at least it was a familiar language. It was left half way through being converted from CJS JavaScript to ESM TypeScript - not in a working state.
I had a go at TypeScript but after a few days of dealing with a seemingly endless stream of errors I finally got it to compile and got runtime errors with references into a monolithic blob of transpiled, compacted code. Life is too short to waste it on such nonsense.
So I stripped out all the TypeScript and reverted to simple ESM JavaScript, and in a few minutes I had it working well enough to run the included demos.
I then was able to add logging and trace the code to understand what it was doing.
It is quite simple, actually, like most good algorithms. The implementation I had picked up was based on Python Jieba module as it was about 12 years ago: quite different from current. And the implementation in JavaScript wasn't ideal, but it worked and I was able to understand it and then quickly rewrite various of the wonky bits.
So now I have an all JavaScript implementation that works with current node/npm and I have updated my analysis script to use it. Speed is good: as fast or maybe a bit faster than my previous, simple implementation of word lookup in cedict.
I don't know where the dictionary and weights come from: whatever Python Jieba was using a decade or so ago. From manual segregation of some corpus of books and articles, if I understand correctly.
I found a nice article from people who had used move subtitles to develop lexicon and statistics for more modern Chinese usage. They have done similar for several other languages. It was actually this paper that lead me to Jieba in the first place, as they had used it in their research: cppjieba.
My revised implementation omits the DAG: it builds the segregation route directly from a 'prefix dictionary'. The dictionary is like current Python Jieba but the algorithm for building the route isn't: no DAG. Much quicker.
I only have basic cut functionality, but that's all I need for the moment.
No comments:
Post a Comment