HTML特殊文字を含めたストップワード
ストップリスト
a a's aacute able about above according accordingly acirc across actually acute aelig after afterwards again against agrave ain't alefsym all allow allows almost alone along alpha already also although always am among amongst amp an and ang another any anybody anyhow anyone anything anyway anyways anywhere apart appear appreciate appropriate are aren't aring around as aside ask asking associated asymp at atilde auml available away awfully b bdquo be became because become becomes becoming been before beforehand behind being believe below beside besides best beta better between beyond both brief brvbar bull but by c c'mon c's came can can't cannot cant cap cause causes ccedil cedil cent certain certainly changes chi circ clearly clubs co com come comes concerning cong consequently consider considering contain containing contains copy corresponding could couldn't course crarr cup curren currently d dagger darr definitely deg delta described despite diams did didn't different divide do does doesn't doing don't done down downwards during e each eacute ecirc edu eg egrave eight either else elsewhere empty emsp enough ensp entirely epsilon equiv especially et eta etc eth euml even ever every everybody everyone everything everywhere ex exactly example except exist f far few fifth first five fnof followed following follows for forall former formerly forth four frasl from further furthermore g gamma ge get gets getting given gives go goes going gone got gotten greetings gt h had hadn't happens hardly harr has hasn't have haven't having he he's hearts hellip hello help hence her here here's hereafter hereby herein hereupon hers herself hi him himself his hither hopefully how howbeit however i i'd i'll i'm i've iacute icirc ie iexcl if ignored igrave image immediate in inasmuch inc indeed indicate indicated indicates infin inner insofar instead int into inward iota iquest is isin isn't it it'd it'll it's its itself iuml j just k kappa keep keeps kept know known knows l lambda lang laquo larr last lately later latter latterly lceil ldquo le least less lest let let's lfloor like liked likely little look looking looks lowast loz lrm lsaquo lsquo lt ltd m macr mainly many may maybe mdash me mean meanwhile merely micro middot might minus more moreover most mostly mu much must my myself n nabla name namely nbsp nd ndash ne near nearly necessary need needs neither never nevertheless new next ni nine no nobody non none noone nor normally not nothing notin novel now nowhere nsub ntilde nu o oacute obviously ocirc oelig of off often ograve oh ok okay old oline omega omicron on once one ones only onto oplus or ordf ordm oslash other others otherwise otilde otimes ought ouml our ours ourselves out outside over overall own p para part particular particularly per perhaps permil perp phi pi piv placed please plus plusmn possible pound presumably prime probably prod prop provides psi q que quite quot qv r radic rang raquo rarr rather rceil rd rdquo re real really reasonably reg regarding regardless regards relatively respectively rfloor rho right rlm rsquo s said same saw say saying says sbquo scaron sdot second secondly sect see seeing seem seemed seeming seems seen self selves sensible sent serious seriously seven several shall she should shouldn't shy sigma sigmaf sim since six so some somebody somehow someone something sometime sometimes somewhat somewhere soon sorry spades specified specify specifying still sub sube such sum sup supe sure szlig t t's take taken tau tell tends th than thank thanks thanx that that's thats the their theirs them themselves then thence there there's thereafter thereby therefore therein theres thereupon these theta thetasym they they'd they'll they're they've think thinsp third this thorn thorough thoroughly those though three through throughout thru thus tilde times to together too took toward towards trade tried tries truly try trying twice two u uacute uarr ucirc ugrave uml un under unfortunately unless unlikely until unto up upon upsih upsilon us use used useful uses using usually uucp uuml v value various very via viz vs w want wants was wasn't way we we'd we'll we're we've weierp welcome well went were weren't what what's whatever when whence whenever where where's whereafter whereas whereby wherein whereupon wherever whether which while whither who who's whoever whole whom whose why will willing wish with within without won't wonder would wouldn't x xi y yacute yen yes yet you you'd you'll you're you've your yours yourself yourselves yuml z zero zeta zwj zwnj
参考サイト
このストップワードは,以下サイトのものを組み合わせた.
1. http://jmlr.org/papers/volume5/lewis04a/a11-smart-stop-list/english.stop
2. http://pst.co.jp/powersoft/html/index.php?f=3401
具体的な組み合わせ作業は次の通り.
組み合わせ作業メモ
2.のHTML特殊文字一覧ページから、&と;に囲まれた文字列を取得する.*2
let xpaths = ["/html/body/div/div/div[3]/table/tbody/tr", "/html/body/div/div/div[4]/table/tbody/tr"]; for(let j=0; j<xpaths.length; j++) { let nodes = document.evaluate(xpaths[j],document,null,7,null); for(let i=0; i<nodes.snapshotLength; i++) { let entity = nodes.snapshotItem(i).childNodes[1].innerHTML.match(/\&\;([a-zA-Z]+)\;/); if(entity && entity[1]){ console.log(entity[1]); } } }
1.のストップワードに2.のHTML特殊文字の文字列を追記して(stopwords.txt)マージ
$ cat stopwords.txt | tr '[A-Z]' '[a-z]' | sort | uniq > stopwords_merge.txt