Data Mining practicum

spark

This is a practicum of my Master study, where our team needs to build our distributed machine learning platform from scratch on the Lrz cloud. Below are our progress:

  • Set up (Week 1 - 2)
    • create Virtual Machines
    • set up hadoop environment
    • set up ansible automation tool
    • set up spark environment
    • set up dask environment
  • PrimeNumber and WordCount Experiments (Week 3 - 4) compare the performance of different algorithms to solve the primenumber and wordcount problems.
    • Hadoop - written in java
    • Hadoop - written in scala
    • PySpark
    • Dask
  • K-Medoids and (Bisecting) K-Means (Week 5 - 9) compare the performance of different algorithms to solve the protein data clustering problem.

presentation slides: google slides