启动spark
bin/spark-shell
scala> val licLines = sc.textFile("LICENSE") licLines: org.apache.spark.rdd.RDD[String] = LICENSE MapPartitionsRDD[1] at textFile at <console>:23
scala> val lineCnt = licLines.count lineCnt: Long = 56
过滤包含BSD字符串的行
BSD
cala> val bsdLines = licLines.filter(line=>line.contains("BSD")) bsdLines: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[2] at filter at <console>:23
# Every record contains a label and feature vector df = spark.createDataFrame(data, ["label", "features"]) # Split the data into train/test datasets train_df, test_df = df.randomSplit([.80, .20], seed=42) # Set hyperparameters for the algorithm rf = RandomForestRegressor(numTrees=100) # Fit the model to the training data model = rf.fit(train_df) # Generate predictions on the test dataset. model.transform(test_df).show()