These snippets show how to make a DataFrame from scratch, using a list of values. This is mainly useful when creating small DataFrames for unit tests. Imagine we would like to have a table with an
id column describing a user and then two columns for the number of cats and dogs she has.
The below version uses the SQLContext approach. In order to test this directly in the pyspark shell, omit the line where
sc is created.
import pyspark from pyspark.sql import SQLContext sc = pyspark.SparkContext() sqlContext = SQLContext(sc) columns = ['id', 'dogs', 'cats'] vals = [ (1, 2, 0), (2, 0, 1) ] df = sqlContext.createDataFrame(vals, columns)
It is generally recommended to use
SparkSession instead of
SQLContext now, the same example is adapted for
SparkSession below. To run in the pypsark shell, skip to the
# make some test data section.
from pyspark.sql.session import SparkSession # instantiate Spark spark = SparkSession.builder.getOrCreate() # make some test data columns = ['id', 'dogs', 'cats'] vals = [ (1, 2, 0), (2, 0, 1) ] # create DataFrame df = spark.createDataFrame(vals, columns)
To check the DataFrame you have created, try