博客
关于我
强烈建议你试试无所不能的chatGPT,快点击我
To execute Mr.LDA
阅读量:4957 次
发布时间:2019-06-12

本文共 2423 字,大约阅读时间需要 8 分钟。

The key to get the visible data is to covert the outcomes to proper format ,which is in HDFS ( Mr.LDA on hadoop ) . The methods in details is in original Mr.LDA , which can be used by referring to README.md . The main steps to train the corpus are following :

1.prepare corpus

Two points must be paid attention to.

  • Firstly , the format of corpus is same as lda-c . Therefore , we have convert corpus to proper format by coding .
  • Secondly , to be dealt with on hadoop , the corpus should be processed again . However , the code is available on original Mr.LDA and what we should do is write a sh file like this :
$ hadoop jar target/mrlda-0.9.0-SNAPSHOT-fatjar.jar cc.mrlda.ParseCorpus \    -input ap-sample.txt -output ap-sample-parsed

A complete will separated into several parts by property like this:

$ hadoop fs -ls ap-sample-parsedap-sample-parsed/documentap-sample-parsed/termap-sample-parsed/title

Then which the corpus we use to run Mr.LDA is coming from this folder .

2.Run "vanilla" LDA

This step costs much time about 1 or 2 hours , using nohup command .

Set some parameters and run it like this :

$ nohup hadoop jar target/mrlda-0.9.0-SNAPSHOT-fatjar.jar \    cc.mrlda.VariationalInference \    -input ap-sample-parsed/document -output ap-sample-lda \    -term 10000 -topic 20 -iteration 50 -mapper 50 -reducer 20 >& lda.log &

3.convert outcomes to proper format

The outcomes processed in the HDFS and isn't visible . If we want to get the visible data , we must convert it to proper format .

Being considerable , the method to convert format need SciPy module in Python , which is used to read data from matlab and similar data . To add the module we only need to type :

$ sudo apt-get install python-scipy

Then we can see the alpha id and beta file in the terminal by using original Mr.LDA . Some questions occur here , which is how to get beta alpha and other files as final outcomes .

z. About evaluation of machine learning

The key to evaluation of any machine learning algorithm is to split the corpus into three dataset : training set , development set , and test set . The training set is used to fit the model , the development set is used to select parameters , and the test set is used for evaluation . For this task , since we do not focus on tuning parameters , we use only the training set and test set .

posted on
2014-12-24 11:26 阅读(
...) 评论(
...)

转载于:https://www.cnblogs.com/cyno/p/4182026.html

你可能感兴趣的文章
链接元素<a>
查看>>
Binding object to winForm controller through VS2010 Designer(通过VS2010设计器将对象绑定到winForm控件上)...
查看>>
Spring Boot实战笔记(二)-- Spring常用配置(Scope、Spring EL和资源调用)
查看>>
第二章:webdriver 控制浏览器窗口大小
查看>>
【动态规划】流水作业调度问题与Johnson法则
查看>>
Python&Selenium&Unittest&BeautifuReport 自动化测试并生成HTML自动化测试报告
查看>>
活现被翻转生命
查看>>
POJ 1228
查看>>
SwaggerUI+SpringMVC——构建RestFul API的可视化界面
查看>>
springmvc怎么在启动时自己执行一个线程
查看>>
流操作的规律
查看>>
Python基础学习15--异常的分类与处理
查看>>
javascript运算符的优先级
查看>>
React + Redux 入门(一):抛开 React 学 Redux
查看>>
13位时间戳和时间格式化转换,工具类
查看>>
vue router-link子级返回父级页面
查看>>
C# 通知机制 IObserver<T> 和 IObservable<T>
查看>>
Code of Conduct by jsFoundation
查看>>
div 只显示两行超出部分隐藏
查看>>
C#小练习ⅲ
查看>>