Data Mining practicum
spark
This is a practicum of my Master study, where our team needs to build our distributed machine learning platform from scratch on the Lrz cloud. Below are our progress:
- Set up (Week 1 - 2)
- create Virtual Machines
- set up hadoop environment
- set up ansible automation tool
- set up spark environment
- set up dask environment
- PrimeNumber and WordCount Experiments (Week 3 - 4) compare the performance of different algorithms to solve the primenumber and wordcount problems.
- Hadoop - written in java
- Hadoop - written in scala
- PySpark
- Dask
- K-Medoids and (Bisecting) K-Means (Week 5 - 9) compare the performance of different algorithms to solve the protein data clustering problem.
presentation slides: google slides