srilm安装及ngram-count简单使用

xitonga

浏览: 590425 次

最近访客更多访客>>

morelily

jccz_zys

haining128

u012363178

博主相关

博客

微博

相册

留言

关于我

文章分类

全部博客 (1278)

社区版块

存档分类

SRILM是一个统计和分析语言模型的工具，提供一些命令行工具，如ngram,ngram-count，可以很方便的统计NGRAM的语言模型。

1，下载

我开始在这个站上下载，感觉很慢。 http://www.speech.sri.com/projects/srilm/download.html。然后直接换了个站下载，直接下载1.5版本的。

wget ftp://ftp.speech.sri.com/pub/people/stolcke/srilm/srilm-1.5.7.tar.gz 。这个版本也不低，现在最高的版本是1.7.

2，安装

我的机器是64位的。

安装这个包依赖于TCL包，TCL的下载地址是：http://www.tcl.tk/software/tcltk/download.html（这个包的安装，很常规，解压后，进入unix目录，下面就有configure文件了）

安装srilm过程：

export SRILM=`pwd`。
make MACHINE_TYPE=i686-m64。
如果提示找不到TCL库之类的错误，就修改Makefile文件，里面有 TCL_INCLUDE 与TCL_LIBRARY 两个变量，比如可以分别设为-I/usr/local/include 以及 -L/usr/local/tcl8.5
进入test目录试一下，cd test ; make all .

这就编译完了，现在的命令行程序都在./bin/i686-m64/目录，我简单的把这个路径加到PATH里面去了。

3，测试

新建一个文本文件，如source.txt，随便搞个内容，如下：

[root@localhost lm]# cat source.txt 
If you do want to use SRILM or are generally interested in it, please consider joining the SRILM user mailing list.
[root@localhost lm]#

然后执行命令

ngram-count -text source.txt -lm source.lm

这就会建立基于source.txt的统计语言模型了，存储在source.lm中，如下：

[root@localhost lm]# cat source.lm 

\data\
ngram 1=22
ngram 2=22
ngram 3=0

\1-grams:
-1.341524	</s>
-99	<s>	-99
-1.341524	If	-99
-1.050479	SRILM	-7.440329
-1.341524	are	-99
-1.341524	consider	-99
-1.341524	do	-99
-1.341524	generally	-99
-1.341524	in	-99
-1.341524	interested	-99
-1.341524	it,	-99
-1.341524	joining	-99
-1.341524	list.	-99
-1.341524	mailing	-99
-1.341524	or	-99
-1.341524	please	-99
-1.341524	the	-99
-1.341524	to	-99
-1.341524	use	-99
-1.341524	user	-99
-1.341524	want	-99
-1.341524	you	-99

\2-grams:
0	<s> If
0	If you
-0.30103	SRILM or
-0.30103	SRILM user
0	are generally
0	consider joining
0	do want
0	generally interested
0	in it,
0	interested in
0	it, please
0	joining the
0	list. </s>
0	mailing list.
0	or are
0	please consider
0	the SRILM
0	to use
0	use SRILM
0	user mailing
0	want to
0	you do

\3-grams:

\end\
[root@localhost lm]#

如果希望只针对指定的词进行统计，就建立一个词列表文件，如source.dict

[root@localhost lm]# cat source.dict 
you
are
list
please
[root@localhost lm]#

这样的话，等下就只是统计这四个单词。执行命令：

ngram-count -text source.txt -lm source.lm -vocab source.dict

结果如下：

[root@localhost lm]# cat source.lm 

\data\
ngram 1=6
ngram 2=0
ngram 3=0

\1-grams:
-0.60206	</s>
-99	<s>
-0.60206	are
-7.180781	list
-0.60206	please
-0.60206	you

\2-grams:

\3-grams:

\end\
[root@localhost lm]#

没有2-grams，修改source.dict，使其可以出现2-grams语法，如下：

[root@localhost lm]# cat source.dict 
you
do
mailing
are
list
please
[root@localhost lm]#

再执行ngram-count，结果如下：

[root@localhost lm]# cat source.lm 

\data\
ngram 1=8
ngram 2=1
ngram 3=0

\1-grams:
-0.7781513	</s>
-99	<s>
-0.7781513	are
-0.7781513	do
-7.269613	list
-0.7781513	mailing
-0.7781513	please
-0.7781513	you	-99

\2-grams:
0	you do

\3-grams:

\end\
[root@localhost lm]#

此时,you do作为一个2-gram出现，表示you 后面有do出现的概率。

分享到：