自动机的思维导图
自动机有很多用途,不光是编译器会用到,众所周知的还有正则表达式。其他,简单的如字符串搜索,存储等,复杂点的如模型验证,定理证明……
最简单的自动机应该算是DFA了,再往简单了说,一个DFA就是一个 map<int,map<char,int>>, 其中 int 是 state id,char是转移标号;如果有人还嫌不够简单,最最简单的那就是 int [][256] 了,就是一个256列的矩阵。就这么简单的一个东西,却蕴含着非常深刻的思想,有着非常广泛的用途。
几个月前因为项目需要,再加上自己的兴趣,实现了一系列自动机相关的算法和数据结构(用C++11写的),其结果和成就感远超我当初一开始的想象和预期,因为有好几个算法我的实现远远胜于任何已有的实现,不光运行速度快得多,内存占用也小得多。今天有点闲暇,整理一个思维导图,以后有时间再慢慢逐个描述。
mind map 最好打开新窗口,再缩放着看,在这里只能看到很小的一部分。
DFA1 DFA Minimization1.1 General DFA Minimization1.1.1 Hopcroft Algorithm1.1.1.1 Partition Refinement1.1.1.2 A mutation to the splitter generator, for performance1.1.1.3 Waiting set strategy1.1.1.3.1 Stack is faster1.1.1.3.2 Queue is slower1.1.1.4 Intersections of Partitions are all empty1.1.1.5 * Efficient implementation needs double linked list * Use array index as the link1.2 Path compression This is not DFA minimization, it's just an optimization. A sequence of states which all have just 1 transition. First state of the sequence couldn't be a final state. Last state of the sequence couldn't be a confluence state.2 ADFA Acyclic DFA2.1 Acyclic DFA Minimization General Minimization Algorithm works for this case, but there are dedicated algorithm, which needs less memory, sometimes faster2.1.1 Sorted input2.1.2 Random input2.1.3 MinADFA or DAWG2.2 Don't support the mapping Use it as a SET This is a dynamic SET2.3 Identify Each Word Could be used as a MAP2.3.1 The word id is the Dictionary ordinal of the word in the DAWG The word<=>id mapping is a bidirectional mapping2.3.2 CLTK Implementaion2.3.2.1 Save a number in the State. 2.3.2.2 The number is the total pathes from the state to all reachable final state.2.3.2.3 Pros: Number update for random inserting word could be efficient enough2.3.2.4 Cons: Mapping a word to its (dictinary) ordinal is not fast enough: O(|Σ|*strlen(word))2.3.3 Febird Implementation2.3.3.1 Save a number in the Transition2.3.3.2 Let s be the source of the transition. Let c be the label char of the transition. The number saved in the transition is the total paths from s which label char is less than c.2.3.3.2.1 The unique single transition of a state was saved in the state's self, and the number is always 0, thus it could be omited2.3.3.2.2 In applications, there are about 50% states just have a single transition2.3.3.2.3 For states which has multiple transitions, the number of its first transtion is always 0, this 0 is not omited, to omit this 0, implementation code would be much complex and nasty2.3.3.3 Pros: Mapping a word to its dictionary ordinal is fast: O(strlen(word))2.3.3.4 Cons: Number update for dynamic inserting word couldn't be efficient3 Aho-Corasick Multiple Pattern Matching3.1 Combine With Numbered ADFA3.2 Word set storage in Adjacent Difference Use less memory4 Febird Implementation: * Many states have 1 single transition * Few states have a few transitions * Very few states have many transitions * Almost all states have a small range of label char4.1 struct State32 { // There are other impl uint32_t spos; char_t minch; char_t maxch; // inclusive unsigned term_bit : 1; unsigned pzip_bit : 1; }; // if minch==maxch, spos is the target state_id // else spos is the offset to targets map in mempool // the targets map consist a bitmap and targets id4.2 All states are saved in a vector * State could be deleted * Deleted states are put into freelist * When state vector is full, double its capacity4.3 Written in C++11 * Templatized, for performance & memory utilization * For concise, use lambda extensively * Use class enum, just for clear5 Abastrac Concept: map<state,map<char,state>>