MangoCool

Spark-2.0.1 GaussianMixture示例代码

2016-10-10 13:58:46   作者:MangoCool   来源:MangoCool

再次学习Spark的机器学习,尝试写了个GaussianMixture demo,大致上弄明白了,再次跟大家分享交流一下,希望有所帮助。

依赖:jdk1.8,scala-2.11.8,spark-2.0.1

开发环境:ideaIU-2016.2.1

测试环境:win7

可运行代码:

package com.dtxy.xbdp.test

import org.apache.log4j.{Level, Logger}
import org.apache.spark.SparkConf
import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.clustering.GaussianMixture
import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator
import org.apache.spark.ml.feature.{StringIndexer, VectorAssembler}
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.types.DoubleType

/**
  * Created by MANGOCOOL on 2016/10/10.
  */
object GaussianMixtureTest {

  System.setProperty("hadoop.home.dir", "E:\\Program Files\\hadoop-2.7.0")

  def main(args: Array[String]): Unit = {

    Logger.getRootLogger.setLevel(Level.WARN)

    val sparkConf = new SparkConf().setAppName("GuassianMixture")

    val spark = SparkSession
      .builder()
      .config(sparkConf)
      .master("local")
      .getOrCreate()

    //直接读parquet格式存储的表
    val df = spark.read.parquet("hdfs://masters/iris")
    df.show

    val labelIndexer = new StringIndexer()
      .setInputCol("label")
      .setOutputCol("indexedLabel")

    val vectorAssembler = new VectorAssembler()
      .setInputCols(Array("f0","f1","f2","f3"))
      .setOutputCol("features")

    // create the trainer and set its parameters
    val trainer = new GaussianMixture()
      .setFeaturesCol("features")
      .setK(3)
      .setSeed(1234L)
      .setMaxIter(500)
      .setTol(0.01)

    //Step 6
    //Randomly split the input data by 8:2, while 80% is for training, the rest is for testing.
    val Array(trainingData, testData) = df.randomSplit(Array(0.7, 0.3))

    /**
      * Step 7
      * Create a ML pipeline which is constructed by for 4 PipelineStage objects.
      * and then call fit method to perform defined operations on training data.
      */
    val pipeline = new Pipeline().setStages(Array(labelIndexer, vectorAssembler, trainer))
    
    // train the model
    val model = pipeline.fit(trainingData)
    
    // compute precision on the test set
    var result = model.transform(testData)
    result.select("*").show(150, false)

    spark.stop()
  }
}

这里提供一个测试数据iris.txt,你可以读取本地,也可以读取hdfs。

下载:iris.txt测试数据

这是我的测试结果:

每次结果都不一样,因为每次学习出来的成绩有好有坏。

标签: Spark GaussianMixture Demo

分享:

上一篇java.lang.IllegalArgumentException: requirement failed: Column features must be of type org.apache.spark.mllib.linalg.VectorUDT@f71b0bce but was actually org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7.

下一篇Spark-2.0.1 Kmeans示例代码

关于我

崇尚极简,热爱技术,喜欢唱歌,热衷旅行,爱好电子产品的一介码农。

座右铭

当你的才华还撑不起你的野心的时候,你就应该静下心来学习,永不止步!

人生之旅历途甚长,所争决不在一年半月,万不可因此着急失望,招精神之萎葸。

Copyright 2015- 芒果酷(mangocool.com) All rights reserved. 湘ICP备14019394号

免责声明:本网站部分文章转载其他媒体,意在为公众提供免费服务。如有信息侵犯了您的权益,可与本网站联系,本网站将尽快予以撤除。