連載記事はてブ記事を用いた興味分析の2つ目の記事です．
ここでは，はてブに登録したWebページ全てに対して，LDA*1によるトピック解析を行い，
はてブに登録している内容のトピック（話題，興味の対象）を分析します．
実装の前提として，データの準備（はてブからブログ記事取得・形態素解析）が完了していることを想定しています．

トピック解析とは？

トピック解析とは，入力データのトピック（話題，分野など，大ざっぱな「意味」）を推定することです．データの抽象化とも言えると思います．クラスタリング *2におけるクラスの推定，次元圧縮*3における基底の推定などと非常に似ています．
本記事におけるトピック解析とは，入力のはてブ記事群におけるジャンル推定を意味します．

トピック解析の詳細は，次のページ（PDF）が分かりやすく，オススメです．

LDAとは？

LDAとは，トピック解析手法の一つで，最も有名な手法です．
潜在意味インデキシング（LSI）*6を確率化したpLSIを完全にベイズ化した手法です．
1つの文書が複数のトピックから確率的に構成されることを仮定した言語モデルの一手法です．

グラフィカルモデルでは，下図のとおりです．
各種変数/定数に意味を赤字にて補足しました．

計算で求めたい文字は，上図のハイパーパラメータα（トピックの分布）*7とβ（トピックごとの単語の分布）*8であり，
他の文字は観測定数もしくは計算過程で用いるだけの変数です．
文字が多く，何が既知で，何が未知なのか混乱しがちですので，注意が必要です．

LDAの詳細は，次のページ/文献がわかりやすく，オススメです．

私も以下記事にまとめてみました．
ni66ling.hatenadiary.jp

LDAによるトピック解析とその結果

ここでは，持橋さんのlda, a Latent Dirichlet Allocation package.*10を用いてはてブ記事群をLDAにかけます．*11
実装の詳細はGitHubを参照ください．
結果，下図のようなワードクラウドを出力します．

このワードクラウドの見方は次のとおりです．

自分自身，何に興味があるのか，一目瞭然になりますね．面白い．．．
次回はHDP-LDA*12によるトピック数自動決定可能なトピック解析について書きます．

*1:Latent Dirichlet Allocation

*2:k-meansクラスタリングや混合ガウス分布による確率的クラスタリングなど

*3:主成分分析や独立成分分析，スパースコーディングなど

*4:トピックモデルの日本語の文献では，持橋大地さんの資料が非常に分かりやすいです．

*5:トピックモデルの提唱者B.M.Bleiさんの資料です．

*6:異なる表現だと，潜在意味解析（LSA），主成分分析（PCA），特異値分解（SVD），KL展開など．それぞれ詳細には異なりますが，結果的にやることは同じです．

*7:トピック数Kだけのベクトル

*8:トピックごとに単語数Nだけのベクトル

*9:ただし，ここではLDAの説明はないです．混合ガウス分布（GMM）のベイズ化について詳細な説明があり，これのアナロジーとしてLDAを捉えると理解しやすいです．

*10:変分ベイズによる実装です．

*11:ただし，はてブ記事数は2008年〜2014年までで4600程度です．

*12:Hierarchical Dirichlet Process-Latent Dirichlet Allocation:階層ディリクレ過程潜在ディリクレ配分法

2014-11-30

HTML特殊文字を含めたストップワード

データ解析自然言語処理

f:id:ni66ling:20160612010154p:plain

はじめに

自然言語処理するにあたって、Web収集した文書についてHTML特殊文字が邪魔したので、それを含めたストップリストを作成した．*1

ストップリスト

a
a's
aacute
able
about
above
according
accordingly
acirc
across
actually
acute
aelig
after
afterwards
again
against
agrave
ain't
alefsym
all
allow
allows
almost
alone
along
alpha
already
also
although
always
am
among
amongst
amp
an
and
ang
another
any
anybody
anyhow
anyone
anything
anyway
anyways
anywhere
apart
appear
appreciate
appropriate
are
aren't
aring
around
as
aside
ask
asking
associated
asymp
at
atilde
auml
available
away
awfully
b
bdquo
be
became
because
become
becomes
becoming
been
before
beforehand
behind
being
believe
below
beside
besides
best
beta
better
between
beyond
both
brief
brvbar
bull
but
by
c
c'mon
c's
came
can
can't
cannot
cant
cap
cause
causes
ccedil
cedil
cent
certain
certainly
changes
chi
circ
clearly
clubs
co
com
come
comes
concerning
cong
consequently
consider
considering
contain
containing
contains
copy
corresponding
could
couldn't
course
crarr
cup
curren
currently
d
dagger
darr
definitely
deg
delta
described
despite
diams
did
didn't
different
divide
do
does
doesn't
doing
don't
done
down
downwards
during
e
each
eacute
ecirc
edu
eg
egrave
eight
either
else
elsewhere
empty
emsp
enough
ensp
entirely
epsilon
equiv
especially
et
eta
etc
eth
euml
even
ever
every
everybody
everyone
everything
everywhere
ex
exactly
example
except
exist
f
far
few
fifth
first
five
fnof
followed
following
follows
for
forall
former
formerly
forth
four
frasl
from
further
furthermore
g
gamma
ge
get
gets
getting
given
gives
go
goes
going
gone
got
gotten
greetings
gt
h
had
hadn't
happens
hardly
harr
has
hasn't
have
haven't
having
he
he's
hearts
hellip
hello
help
hence
her
here
here's
hereafter
hereby
herein
hereupon
hers
herself
hi
him
himself
his
hither
hopefully
how
howbeit
however
i
i'd
i'll
i'm
i've
iacute
icirc
ie
iexcl
if
ignored
igrave
image
immediate
in
inasmuch
inc
indeed
indicate
indicated
indicates
infin
inner
insofar
instead
int
into
inward
iota
iquest
is
isin
isn't
it
it'd
it'll
it's
its
itself
iuml
j
just
k
kappa
keep
keeps
kept
know
known
knows
l
lambda
lang
laquo
larr
last
lately
later
latter
latterly
lceil
ldquo
le
least
less
lest
let
let's
lfloor
like
liked
likely
little
look
looking
looks
lowast
loz
lrm
lsaquo
lsquo
lt
ltd
m
macr
mainly
many
may
maybe
mdash
me
mean
meanwhile
merely
micro
middot
might
minus
more
moreover
most
mostly
mu
much
must
my
myself
n
nabla
name
namely
nbsp
nd
ndash
ne
near
nearly
necessary
need
needs
neither
never
nevertheless
new
next
ni
nine
no
nobody
non
none
noone
nor
normally
not
nothing
notin
novel
now
nowhere
nsub
ntilde
nu
o
oacute
obviously
ocirc
oelig
of
off
often
ograve
oh
ok
okay
old
oline
omega
omicron
on
once
one
ones
only
onto
oplus
or
ordf
ordm
oslash
other
others
otherwise
otilde
otimes
ought
ouml
our
ours
ourselves
out
outside
over
overall
own
p
para
part
particular
particularly
per
perhaps
permil
perp
phi
pi
piv
placed
please
plus
plusmn
possible
pound
presumably
prime
probably
prod
prop
provides
psi
q
que
quite
quot
qv
r
radic
rang
raquo
rarr
rather
rceil
rd
rdquo
re
real
really
reasonably
reg
regarding
regardless
regards
relatively
respectively
rfloor
rho
right
rlm
rsquo
s
said
same
saw
say
saying
says
sbquo
scaron
sdot
second
secondly
sect
see
seeing
seem
seemed
seeming
seems
seen
self
selves
sensible
sent
serious
seriously
seven
several
shall
she
should
shouldn't
shy
sigma
sigmaf
sim
since
six
so
some
somebody
somehow
someone
something
sometime
sometimes
somewhat
somewhere
soon
sorry
spades
specified
specify
specifying
still
sub
sube
such
sum
sup
supe
sure
szlig
t
t's
take
taken
tau
tell
tends
th
than
thank
thanks
thanx
that
that's
thats
the
their
theirs
them
themselves
then
thence
there
there's
thereafter
thereby
therefore
therein
theres
thereupon
these
theta
thetasym
they
they'd
they'll
they're
they've
think
thinsp
third
this
thorn
thorough
thoroughly
those
though
three
through
throughout
thru
thus
tilde
times
to
together
too
took
toward
towards
trade
tried
tries
truly
try
trying
twice
two
u
uacute
uarr
ucirc
ugrave
uml
un
under
unfortunately
unless
unlikely
until
unto
up
upon
upsih
upsilon
us
use
used
useful
uses
using
usually
uucp
uuml
v
value
various
very
via
viz
vs
w
want
wants
was
wasn't
way
we
we'd
we'll
we're
we've
weierp
welcome
well
went
were
weren't
what
what's
whatever
when
whence
whenever
where
where's
whereafter
whereas
whereby
wherein
whereupon
wherever
whether
which
while
whither
who
who's
whoever
whole
whom
whose
why
will
willing
wish
with
within
without
won't
wonder
would
wouldn't
x
xi
y
yacute
yen
yes
yet
you
you'd
you'll
you're
you've
your
yours
yourself
yourselves
yuml
z
zero
zeta
zwj
zwnj

参考サイト

このストップワードは，以下サイトのものを組み合わせた．
1. http://jmlr.org/papers/volume5/lewis04a/a11-smart-stop-list/english.stop
2. http://pst.co.jp/powersoft/html/index.php?f=3401
具体的な組み合わせ作業は次の通り．

組み合わせ作業メモ

2.のHTML特殊文字一覧ページから、&と;に囲まれた文字列を取得する．*2

let xpaths = ["/html/body/div/div/div[3]/table/tbody/tr", 
              "/html/body/div/div/div[4]/table/tbody/tr"];
for(let j=0; j<xpaths.length; j++) {
  let nodes = document.evaluate(xpaths[j],document,null,7,null);
  for(let i=0; i<nodes.snapshotLength; i++) {
    let entity = nodes.snapshotItem(i).childNodes[1].innerHTML.match(/\&amp\;([a-zA-Z]+)\;/);
    if(entity && entity[1]){
      console.log(entity[1]);
    }
  }
}

1.のストップワードに2.のHTML特殊文字の文字列を追記して(stopwords.txt)マージ

$ cat stopwords.txt | tr '[A-Z]' '[a-z]' | sort | uniq > stopwords_merge.txt

*1:そもそもパースしろよっていうツッコミはあるが，もしかしたらニーズがあるかも知れないのでここに残す．

*2:例えば$amp;のamp

2014-10-07

ffmpegで動画ファイルの静止画サムネイルを作成する

ffmpeg sh

はじめに

「動画ファイルが多量に存在するけど，ファイル名が適当すぎて目的のファイルをなかなか見つけられない」状況が生じたのでメモ．
方法をざっくり言うと，特定ディレクトリ内のすべての動画ファイルについて，N x Nマスの静止画サムネイルを作成*1する．

やりかた

#!/bin/sh
dir_path="." #1
tiles_h=8 #4
tiles_v=8 #4
tiles=$(($tiles_h * $tiles_v))
for file in `find $dir_path -type f \( -name "*.flv" -o -name "*.mp4" -o -name "*.avi" \)`
do 
  frames=`ffmpeg -i "$file" 2>&1 | awk '/totalframes/{print $3}'`
  if [ "$frames" = "" ]; then 
    continue;
  else
    fps=$(($frames / $tiles)); #3
    if [ $fps -gt 1000 ]; then #2
      fps=1000;
    fi
    ffmpeg -y -i "$file" -vf thumbnail=${fps},tile=${tiles_h}x${tiles_v},scale=1920:-1 "${file%.*}_thumb.jpg"; #3
  fi
done

#1: サムネイル作成対象のディレクトリ．
#2: ffmpegのバグ？か分からないが，-vfオプションのthumbnail値であまりにも大きな値（例えば5000等）を設定するとエラーで落ちるため，1000を上限閾値とした．
#3: ここで静止画サムネイルを作成．出力サイズはscale値で設定し，-1の場合はソースから自動決定する．
#4: サムネイルの分割数．
ワンラインで書くと次の通り．コピペで便利．

$ for file in `find . -type f \( -name "*.flv" -o -name "*.mp4" -o -name "*.avi" \)`; do frames=`ffmpeg -i "$file" 2>&1 | awk '/totalframes/{print $3}'`; if [ "$frames" = "" ]; then continue; else (fps=$(($frames/64)); if [ $fps -gt 1000 ]; then fps=1000; fi; ffmpeg -y -i "$file" -vf thumbnail=$fps,tile=8x8,scale=1920:-1 "${file%.*}_thumb.jpg") fi; done

補足

定期的に実行する場合は，下記内容を「/home/ni66ling/local/bin/mk_thumbnail.sh」とかに保存して，

#!/bin/sh
dir_path="."
tiles_h=8
tiles_v=8
tiles=$(($tiles_h * $tiles_v))
for file in `find $dir_path -type f \( -name "*.flv" -o -name "*.mp4" -o -name "*.avi" \)`
do 
  frames=`ffmpeg -i "$file" 2>&1 | awk '/totalframes/{print $3}'`
  if [ "$frames" = "" ]; then 
    continue;
  else
    fps=$(($frames / $tiles));
    if [ $fps -gt 1000 ]; then
      fps=1000;
    fi
    ffmpeg -y -i "$file" -vf thumbnail=${fps},tile=${tiles_h}x${tiles_v},scale=1920:-1 "${file%.*}_thumb.jpg";
  fi
done

cronで定期実行するように設定してあげれば良さそう．

00  3  *  *  * /home/ni66ling/local/bin/mk_thumbnail.sh

まだ試してないけど，あとで試す予定．*2

追記 2014/10/12

上記でうまくいかないケースがあった*3ので，その対処を行った．
全フレーム数の取得を，totalframesから取得するのではなく，durationとfpsから算出するように変更した．
それと，ログ出力と，不要サムネイルの削除*4を追加した．

#!/bin/sh

input_dir_path="."                                                          # 入力ディレクトリパス
output_dir_path="${input_dir_path}/thumbs/"                                 # 出力ディレクトリパス
log_file_path="${output_dir_path}/mk_thumbnail_`date '+%Y%m%d%H%M%S'`.log"  # ログファイルパス

ffmpeg="./ffmpeg"  # ffmpegバイナリパス
find="/bin/find"   # findバイナリパス

tiles_h=8          # サムネイルタイルの水平方向数
tiles_v=8          # サムネイルタイルの垂直方向数

# 出力ディレクトリ生成
if [ ! -e "$output_dir_path" ]; then
  mkdir -p "$output_dir_path"
fi

# サムネイル作成対象のファイル数の取得
IFS=$'\n';
input_files_cmd='$find $input_dir_path -maxdepth 1 -type f \( -name "*.flv" -o -name "*.mp4" -o -name "*.avi" -o -name "*.ts" -o -name "*.wmv" \)'
nof_files=`eval "$input_files_cmd" | wc -l`
echo "#input files: $nof_files" | tee $log_file_path

counter=1
for file in `eval "$input_files_cmd"`; do
  IFS=$' \t\n'
    
  printf "processing:$file ($counter/$nof_files) is started... " | tee -a $log_file_path
  counter=$(($counter + 1))

  # 既にサムネイルがあるならスキップ
  input_filename="${file##*/}"
  output_filepath="${output_dir_path}${input_filename%.*}_thumb.jpg"
  if [ -e  "$output_filepath" ]; then
    printf "already existing.\n" | tee -a $log_file_path
    continue
  fi
  
  # 動画フレーム数の計算（Frames = Duration * FPS）
  hms=(`echo \`$ffmpeg -i "$file" 2>&1 | awk '/Duration/{print $2}' | sed 's/\..*//g' | tr -s ':' ' ' \` `)
  if [ ${#hms[*]} -eq 0 ]; then
    printf "error (duration is not defined).\n" | tee -a $log_file_path
    continue
  fi
  duration_sec=`expr ${hms[0]} \* 3600 + ${hms[1]} \* 60 + ${hms[2]} \* 1`
  src_fps=`$ffmpeg -i "$file" 2>&1 | awk '/fps/{print $0}' | sed 's/\ fps.*//' | sed 's/.*\ //'`
  if [ "$src_fps" = "" ]; then
    printf "error (fps is not defined).\n" | tee -a $log_file_path
    continue
  fi
  frames=`echo "$duration_sec * $src_fps" | bc | sed 's/\..*//g'`

  # サムネイル作成処理
  tiles=$(($tiles_h * $tiles_v))
  thumb_fps=`expr $frames \/ $tiles` 
  if [ $thumb_fps -gt 1000 ]; then 
    thumb_fps=1000
  fi
  ffmpeg_msg=`$ffmpeg -y -i "$file" -vf thumbnail=${thumb_fps},tile=${tiles_h}x${tiles_v},scale=1920:-1 "$output_filepath" \
    2>&1 > /dev/null | grep -E 'error|failed' | tr -d '\n'`
  if [ "$ffmpeg_msg" = "" ]; then
    printf "succeeded.\n" | tee -a $log_file_path
  else
    printf "error (ffmpeg:$ffmpeg_msg)\n" | tee -a $log_file_path
  fi
done
printf "making thumbnail is done !!\n\n"
printf "================================================================================\n\n"

# 出力ディレクトリにおけるサムネイルについて，対応するソースの動画が存在しないなら，サムネイルを削除
echo "checking existing thumbnails."
IFS=$'\n';
input_files_cmd='$find $output_dir_path -maxdepth 1 -mindepth 1 -regex ".*_thumb\.jpg$"'
nof_files=`eval "$input_files_cmd" | wc -l`
echo "#thumbnail files: $nof_files" | tee -a $log_file_path

counter=1
for file in `eval "$input_files_cmd"`; do
  counter=$(($counter + 1))
  video_file_name=`echo $file | sed 's/.*\/\([^\/]*\)_thumb\.jpg/\1/' | sed 's/\(\[\|\]\|\*\|\?\)/\\\\\1/g'`
  find_video_cmd_result=`$find $input_dir_path -maxdepth 1 -mindepth 1 -name "${video_file_name}.*"`
  if [ "" = "$find_video_cmd_result" ]; then
    rm -f "$file"
    printf "source video file of thumbnail:${file} is missing. therefore, this thumbnail is removed.\n" | tee -a $log_file_path
  fi
done
printf "complete !!"