The key to get the visible data is to covert the outcomes to proper format ,which is in HDFS ( Mr.LDA on hadoop ) . The methods in details is in original Mr.LDA , which can be used by referring to README.md . The main steps to train the corpus are following :
1.prepare corpus
Two points must be paid attention to.
- Firstly , the format of corpus is same as lda-c . Therefore , we have convert corpus to proper format by coding .
- Secondly , to be dealt with on hadoop , the corpus should be processed again . However , the code is available on original Mr.LDA and what we should do is write a sh file like this :
$ hadoop jar target/mrlda-0.9.0-SNAPSHOT-fatjar.jar cc.mrlda.ParseCorpus \ -input ap-sample.txt -output ap-sample-parsed
A complete will separated into several parts by property like this:
$ hadoop fs -ls ap-sample-parsedap-sample-parsed/documentap-sample-parsed/termap-sample-parsed/title
Then which the corpus we use to run Mr.LDA is coming from this folder .
2.Run "vanilla" LDA
This step costs much time about 1 or 2 hours , using nohup command .
Set some parameters and run it like this :$ nohup hadoop jar target/mrlda-0.9.0-SNAPSHOT-fatjar.jar \ cc.mrlda.VariationalInference \ -input ap-sample-parsed/document -output ap-sample-lda \ -term 10000 -topic 20 -iteration 50 -mapper 50 -reducer 20 >& lda.log &
3.convert outcomes to proper format
The outcomes processed in the HDFS and isn't visible . If we want to get the visible data , we must convert it to proper format .
Being considerable , the method to convert format need SciPy module in Python , which is used to read data from matlab and similar data . To add the module we only need to type :$ sudo apt-get install python-scipy
Then we can see the alpha id and beta file in the terminal by using original Mr.LDA . Some questions occur here , which is how to get beta alpha and other files as final outcomes .
z. About evaluation of machine learning
The key to evaluation of any machine learning algorithm is to split the corpus into three dataset : training set , development set , and test set . The training set is used to fit the model , the development set is used to select parameters , and the test set is used for evaluation . For this task , since we do not focus on tuning parameters , we use only the training set and test set .