autoscale:true
[email protected]
, @rahuldave, [email protected]
Class in in Science Center B starting THIS thursday, 17th Sep, 2015!
It took about three years before the BellKor’s Pragmatic Chaos team managed to win the prize ... The winning algorithm was ... so complex that it was never implemented by Netflix. 1
It’s important that our data team wasn’t comprised solely of mathematicians and other “data people.” It’s a fully integrated product group that includes people working in design, web development, engineering, product marketing, and operations. They all understand and work with data, and I consider them all data scientists... Often, an engineer can have the insight that makes it >clear how the product’s design should work, or vice-versa — a designer can have the insight that helps the engineers understand how to better use the data. Or it may take someone from marketing to understand what a customer really wants to accomplish.2
- compute: code, python, R, julia, spark, hadoop
- storage/database: git, SQL, NoSQL, HBase, disk, memory
- devops: AWS, docker, mesos, repeatability
- product: database, web, API, viz, UI, story
- memory
- disk: what if we do not fit?
- cluster: what if we still do not fit?
- cluster: what if we need/can use parts?
- What if we MUST bring compute to disk?
- relational: pandas, SQL: Postgres, sqlite, Hbase, VoltDB
- document oriented: MongoDB, CouchDB
- key-value: Riak, Redis, Memcached
- graph oriented: Neo4J
- What is a relational Database?
- What Grammar of Data does it follow?
- How is this grammar implemented in Pandas?
- How is this grammar implemented in SQL
- A collection of tables related to each other through common data values.
- Rows represent attributes of something
- Everything in a column is values of one attributes
- A cell is expected to be atomic
- Tables are related to each other if they have columns called keys which represent the same values
- Quantitative (Interval and Ratio)
- Ordinal
- Nominal3
Been there for a while (SQL, Pandas), formalized in dplyr
4.
- provide simple verbs for simple things. These are functions corresponding to common data manipulation tasks
- second idea is that backend does not matter. Here we constrain ourselves to Pandas and sqlite
- multiple backends implemented in Pandas, Spark, Impala, Pig, dplyr, ibis, blaze
- learn hot to do core data manipulations, no matter what the system
- relational databases critical for mon-memory fits. Big installed base.
- one off questions: google, stack-overflow, http://chrisalbon.com
[fit]GO TO NOTEBOOK5
- data structure regularity is known
- transactions are required
- benefit from years of tuning
- not good for deep hierarchy
- which kind depends on use case: pandas, hbase, columnar, postgres,...
Footnotes
-
D. J. Patil, U.S. Chief Data Scientist, Building data science teams. " O'Reilly Media, Inc.", 2011. ↩
-
S. S. Stevens, Science, New Series, Vol. 103, No. 2684 (Jun. 7, 1946), pp. 677-680 ↩
-
Hadley Wickham: https://cran.rstudio.com/web/packages/dplyr/vignettes/introduction.html ↩
-
Diagram from 7 databases in 7 weeks, Pragmatic Programmers ↩