Life Long Programmer's Community Log: Saprk DataFrame RDDs are an immutable distributed collection of data. This collection can be of any ...

Please Visit: http://ift.tt/1ajReyV

Saprk DataFrame
RDDs are an immutable distributed collection of data. This collection can be of any datatype like String, Integer or tuples. For most of the production data, it ends up being the RDD of tuples.
RDD itself does not have any information about schema of the data it contains.
SchemaRDD is an RDD whose elements are Row objects. Along with that it also contain schema of these elements.
SchemaRDD = RDD [Row] + Schema

DataFrame is a distributed collection of data organized in named columns.
http://ift.tt/1Omfied

Introducing DataFrames in Spark for Large Scale Data Science
a DataFrame is a distributed collection of data organized into named columns. It is conceptually equivalent to a table in a relational database or a data frame in R/Python, but with richer optimizations under the hood.

Under the Hood: Intelligent Optimization and Code Generation
Before any computation on a DataFrame starts, the Catalyst optimizer compiles the operations that were used to build the DataFrame into a physical plan for execution.
Catalyst applies logical optimizations such as predicate pushdown.

Catalyst compiles operations into physical plans for execution and generates JVM bytecode for those plans that is often more optimized than hand-written code.

http://ift.tt/1CEAVej
http://ift.tt/1Omfied
http://ift.tt/1EMa0je

from Public RSS-Feed of Jeffery yuan. Created with the PIXELMECHANICS 'GPlusRSS-Webtool' at http://gplusrss.com http://ift.tt/1Omfiuv
via LifeLong Community

Life Long Programmer's Community Log

Saprk DataFrame RDDs are an immutable distributed collection of data. This collection can be of any ...

No comments:

Post a Comment