A Software Engineer Blog: tmパッケージを使ったサンプル [R言語]

2014年9月3日水曜日

tmパッケージを使ったサンプル [R言語]

R言語でtmパッケージを使ったサンプル

まずコーパスを作成して、TermDocumentMatrixを作る。

R> library('tm')
R> con <- file(file.path("source.txt"),open="rt")
R> text <- readLines(con)
R> close(con)
R> corpus <- Corpus(VectorSource(text))
R> tdm <- TermDocumentMatrix(corpus)

読み込んでるsource.txtの内容は次のようなサンプル文

This is a first line.
This is a second line.
This is a third line.
The forth line is being written.

得られたTermDocumentMatrixをいろいろいじってみる。

inspect()

各Docs(今の例では1行がひとつのDocに対応)に出てくるTermを全部表示する。

R> inspect(tdm)
A term-document matrix (10 terms, 4 documents)

Non-/sparse entries: 14/26
Sparsity : 65%
Maximal term length: 8
Weighting : term frequency (tf)

Docs
Terms 1 2 3 4
being 0 0 0 1
first 1 0 0 0
forth 0 0 0 1
line 0 0 0 1
line. 1 1 1 0
second 0 1 0 0
the 0 0 0 1
third 0 0 1 0
this 1 1 1 0
written. 0 0 0 1

例えば一行目は"This is a first line"だったので、Docs 1としてfirst, line, thisがそれぞれ1回カウントされている。

findFreqTerms

3回以上登場するtermを抽出する。

R> findFreqTerms(tdm,3)
[1] "line." "this"

findAssocs

指定した単語に共起されるtermを抽出する。第3引数は共起率。

R> findAssocs(tdm,"this",0.1)
line. first second third
1.00 0.33 0.33 0.33

A Software Engineer Blog

2014年9月3日水曜日

tmパッケージを使ったサンプル [R言語]

inspect()

findFreqTerms

findAssocs

0 件のコメント:

コメントを投稿