数据集和SQL_MonogDB 中文网

MongoDB工具 >MongoDB Spark连接器 >Spark Connector Java指南 > 数据集和SQL

数据集

Dataset API提供了RDD的类型安全性和功能编程优势，以及DataFrame API的关系模型和性能优化。DataFrame在Java API中不再以类的形式存在，因此Dataset<Row>必须使用它来引用以后的DataFrame。

以下应用程序演示了如何Dataset使用隐式架构创建a Dataset ，使用显式架构创建a 以及如何在数据集上运行 SQL查询。

考虑一个名为的集合characters：

复制

{ "_id" : ObjectId("585024d558bef808ed84fc3e"), "name" : "Bilbo Baggins", "age" : 50 }
{ "_id" : ObjectId("585024d558bef808ed84fc3f"), "name" : "Gandalf", "age" : 1000 }
{ "_id" : ObjectId("585024d558bef808ed84fc40"), "name" : "Thorin", "age" : 195 }
{ "_id" : ObjectId("585024d558bef808ed84fc41"), "name" : "Balin", "age" : 178 }
{ "_id" : ObjectId("585024d558bef808ed84fc42"), "name" : "Kíli", "age" : 77 }
{ "_id" : ObjectId("585024d558bef808ed84fc43"), "name" : "Dwalin", "age" : 169 }
{ "_id" : ObjectId("585024d558bef808ed84fc44"), "name" : "Óin", "age" : 167 }
{ "_id" : ObjectId("585024d558bef808ed84fc45"), "name" : "Glóin", "age" : 158 }
{ "_id" : ObjectId("585024d558bef808ed84fc46"), "name" : "Fíli", "age" : 82 }
{ "_id" : ObjectId("585024d558bef808ed84fc47"), "name" : "Bombur" }

复制

package com.mongodb.spark_examples;

import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;

import com.mongodb.spark.MongoSpark;


public final class DatasetSQLDemo {

  public static void main(final String[] args) throws InterruptedException {

    SparkSession spark = SparkSession.builder()
      .master("local")
      .appName("MongoSparkConnectorIntro")
      .config("spark.mongodb.input.uri", "mongodb://127.0.0.1/test.myCollection")
      .config("spark.mongodb.output.uri", "mongodb://127.0.0.1/test.myCollection")
      .getOrCreate();

    // Create a JavaSparkContext using the SparkSession's SparkContext object
    JavaSparkContext jsc = new JavaSparkContext(spark.sparkContext());

    // Load data and infer schema, disregard toDF() name as it returns Dataset
    Dataset<Row> implicitDS = MongoSpark.load(jsc).toDF();
    implicitDS.printSchema();
    implicitDS.show();

    // Load data with explicit schema
    Dataset<Character> explicitDS = MongoSpark.load(jsc).toDS(Character.class);
    explicitDS.printSchema();
    explicitDS.show();

    // Create the temp view and execute the query
    explicitDS.createOrReplaceTempView("characters");
    Dataset<Row> centenarians = spark.sql("SELECT name, age FROM characters WHERE age >= 100");
    centenarians.show();

    // Write the data to the "hundredClub" collection
    MongoSpark.write(centenarians).option("collection", "hundredClub").mode("overwrite").save();

    // Load the data from the "hundredClub" collection
    MongoSpark.load(sparkSession, ReadConfig.create(sparkSession).withOption("collection", "hundredClub"), Character.class).show();

    jsc.close();

  }
}

隐式声明一个架构

要从MongoDB数据创建数据集，请通过加载数据 MongoSpark并调用JavaMongoRDD.toDF()方法。尽管 toDF()听起来像是一种DataFrame方法，但它是Dataset API的一部分，并返回Dataset<Row>。

每当从MongoDB中读取数据并将其存储在中时，只要Dataset<Row>不指定定义模式的 Java bean，就可以推断出数据集的模式。通过从数据库中采样文档来推断该模式。要显式声明一个模式，请参阅显式声明一个模式。

以下操作从MongoDB加载数据，然后使用Dataset API创建数据集并推断模式：

复制

Dataset<Row> implicitDS = MongoSpark.load(jsc).toDF();
implicitDS.printSchema();
implicitDS.show();

implicitDS.printSchema() 将以下模式输出到控制台：

复制

root
 |-- _id: struct (nullable = true)
 |    |-- oid: string (nullable = true)
 |-- age: integer (nullable = true)
 |-- name: string (nullable = true)

implicitDS.show() 将以下内容输出到控制台：

复制

+--------------------+----+-------------+
|                 _id| age|         name|
+--------------------+----+-------------+
|[585024d558bef808...|  50|Bilbo Baggins|
|[585024d558bef808...|1000|      Gandalf|
|[585024d558bef808...| 195|       Thorin|
|[585024d558bef808...| 178|        Balin|
|[585024d558bef808...|  77|         Kíli|
|[585024d558bef808...| 169|       Dwalin|
|[585024d558bef808...| 167|          Óin|
|[585024d558bef808...| 158|        Glóin|
|[585024d558bef808...|  82|         Fíli|
|[585024d558bef808...|null|       Bombur|
+--------------------+----+-------------+

明确声明架构

默认情况下，从MongoDB中读取内容SparkSession是通过对集合中的文档进行采样来推断架构的。您还可以使用来显式定义架构，从而消除采样所需的额外查询。Java bean

注意

如果为架构提供案例类，则MongoDB 仅返回声明的字段。这有助于最小化通过电线发送的数据。

以下语句创建一个，然后使用它为定义架构：Character Java beanDataFrame

复制

import java.io.Serializable;

public final class Character implements Serializable {
    private String name;
    private Integer age;

    public String getName() {
        return name;
    }

    public void setName(String name) {
        this.name = name;
    }

    public Integer getAge() {
        return age;
    }

    public void setAge(final Integer age) {
        this.age = age;
    }
}

Bean传递给方法以定义数据集的架构：toDS( Class<T> beanClass )

复制

Dataset<Character> explicitDS = MongoSpark.load(jsc).toDS(Character.class);
explicitDS.printSchema();
explicitDS.show();

explicitDS.printSchema() 输出以下内容：

复制

root
 |-- age: integer (nullable = true)
 |-- name: string (nullable = true)

explicitDS.show() 输出以下内容：

复制

+----+-------------+
| age|         name|
+----+-------------+
|  50|Bilbo Baggins|
|1000|      Gandalf|
| 195|       Thorin|
| 178|        Balin|
|  77|         Kíli|
| 169|       Dwalin|
| 167|          Óin|
| 158|        Glóin|
|  82|         Fíli|
|null|       Bombur|
+----+-------------+

的SQL

在数据集上运行SQL查询之前，必须为数据集注册一个临时视图。

以下操作注册一个 characters表，然后查询该表以查找所有100个或更旧的字符：

复制

explicitDS.createOrReplaceTempView("characters");
Dataset<Row> centenarians = spark.sql("SELECT name, age FROM characters WHERE age >= 100");
centenarians.show();

centenarians.show() 输出以下内容：

复制

+-------+----+
|   name| age|
+-------+----+
|Gandalf|1000|
| Thorin| 195|
|  Balin| 178|
| Dwalin| 169|
|    Óin| 167|
|  Glóin| 158|
+-------+----+

将DataFrame保存到MongoDB

MongoDB Spark Connector提供了将DataFrames持久存储到MongoDB中的集合的功能。

以下操作将保存centenarians到hundredClub MongoDB 的集合中：

复制

/* Note: "overwrite" drops the collection before writing,
 * use "append" to add to the collection */
MongoSpark.write(centenarians).option("collection", "hundredClub")
    .mode("overwrite").save();

MongoDB 工具

数据集和SQL

数据集

隐式声明一个架构

明确声明架构

的SQL

将DataFrame保存到MongoDB