Big Data! Great! Now What? #SymfonyCon 2014
Transcript
- 1. BIG DATA! Great! Now what? Ricard Clau SymfonyCon 2014
- 2. HELLO WORLD! • Ricard Clau, born and grown up in Barcelona • Server engineer at Another Place Productions • Symfony2 lover and PHP believer (sometimes…) • Open-source contributor, sometimes I give talks • Twitter (@ricardclau) / Gmail ricard.clau@gmail.com
- 3. WE WILL TALK ABOUT… • Where / How to store / query our “BIG” DATA • SQL vs NoSQL, why we ended up here? • Strengths and weaknesses of both approaches • PHP / Symfony Status with these technologies • Some war stories and recommendations
- 4. QUICK DISCLAIMERS • Not your average PHP talk, not sure if you will be able to use this next week at work • Continuous learner about all these technologies • 100M records is NOT BIG DATA
- 5. “Big data is like teenage sex; everyone talks about it, nobody really knows how to do it, everyone thinks everyone else is doing it, so everyone claims they are doing it”. Dan Ariely, Duke University
- 6. 2 BIG PROBLEMS
- 7. PROBLEM 1: STORAGE
- 8. PROBLEM 2: QUERYING
- 9. A BIT OF HISTORY Maybe we have not learnt so much…
- 10. A (NOT SO) LONG TIME AGO • Programmers processed files directly • Lots of people doing the same, first databases appeared, different APIs, strengths and weaknesses • In the early 70s IBM came with the SEQUEL (Structured English Query Language) idea, and the rest is story
- 11. WHY NOSQL EXISTS? • RDBMS are not brilliant to scale horizontally • Google, Amazon, Facebook, etc… started building their own solutions to meet their unique needs • When your data does not fit in one box, you need to give up consistency or availability • Some problems need a different approach
- 12. THE CURRENT CHAOS
- 13. RDBMS SYSTEMS Old rockers never die
- 14. SQL • A “common” query language • We can normalise data and query it • Easy to do joins, filters, aggregations • We don’t need to know in advance how we access data • We rely on each database server’s query optimiser (and sometimes we need a DBA)
- 15. ACID PROPERTIES A C I D Atomicity Transactions are all or nothing Consistency A transaction is subject to a set of rules Isolation Transactions do not affect each other Durability Written data will not get lost
- 16. WE NEED ACID • Banking, logistics, finance, e-commerce,… • Systems we started building 30 years ago… and we still work on them generating millions of $ daily! • There are many applications that still fit the relational model and have structured data
- 17. USUAL PROBLEMS • You can painfully achieve sharding, but you need to give up some ACID goods • Tricky for unstructured data • Not great for small read / write ratio • Some data structures
- 18. TRICKY SCENARIOS • Geospatial queries for augmented reality • Leaderboards for social activity, Sets operations • Columnar aggregations on big tables • Graph data traversing to analyse your customers • Search engines over big chunks of text
- 19. NOSQL SYSTEMS Different problems, different solutions
- 20. BASE PROPERTIES • Basically Available: appears to work most of the time • Soft state: state of the system may change even without a query • Eventual consistency
- 21. CAP THEOREM • A shared-data system cannot guarantee simultaneously: • Consistency: All clients have the same view of the data • Availability: Each client can always read and write • Partition tolerance: The system works well even when there are network partitions
- 22. “During a network partition, a distributed system must choose between either Consistency or Availability”
- 23. Availability Consistency Partition Tolerance Single Node, mostly RDBMS (MySQL, PostgreSQL, DB2, SQLite…) All nodes same role (Cassandra, Riak, DynamoDB…) Special nodes (Zookeeper, HBase, MongoDB, Redis…)
- 24. CONSISTENT HASHING
- 25. I TOTALLY NEED ACID! Are you sure about that?
- 26. EVENTUAL CONSISTENCY If you are using master-slave replication, you already have eventual consistency in your reads
- 27. ANALYTICS / STATS We can possibly afford losing a small % of the data
- 28. TRANSACTIONS Bank transfers happen asynchronously as well!
- 29. WHAT ABOUT PHP & SYMFONY? Is there any hope for us?
- 30. PHP: BEST WEB PLATFORM? • PHP is still heavily used, despite its many quirks • Mature, actively maintained libraries for everything • Composer makes things much easier these days • Symfony bundles for almost everything • Some databases consider PHP a second class citizen
- 31. Key-value Graph Column Document
- 32. KEY-VALUE STORES • Simple APIs, easy to install and use. You are already using them for caching, sessions, etc… • PHP Extensions: memcached, phpredis • Libraries: nrk/predis, basho/riak, aws/aws-sdk-php • Bundles: snc/redis-bundle, leaseweb/memcache-bundle, kbrw/riak-bundle
- 33. GRAPH DATABASES • Very verbose queries, access via REST APIs • Maybe not mature enough for source of truth • Libraries: everyman/neo4jphp • Bundles: klaussilveira/neo4j-ogm-bundle • IMHO, one of the next big things
- 34. CYPHER QUERY EXAMPLES Top 5 Sushi restaurants in New York for Philip’s friends 2nd degree co-actors who have never acted with Tom Hanks
- 35. COLUMN-BASED STORAGES • Possibly the most suitable for Big Data • Redshift supports SQL in a petabyte scale database • Libraries: thobbs/phpcassa, pop/pop_hbase, PDO for Redshift (with some quirks) • IMHO, Cassandra will become THE database
- 36. DOCUMENT DATABASES • MongoDB and Couchbase look very shiny… but the Internet is FULL of horror scaling stories • PHP Extensions: mongodb, couchbase • Libraries: doctrine/mongodb • Bundles: doctrine/mongodb-odm-bundle
- 37. SEARCH ENGINES • Mostly Lucene based • PHP Extensions: solr, sphinx • Libraries: solarium/solarium, elasticsearch/ elasticsearch • Bundles: nelmio/solarium-bundle, friendsofsymfony/elastica-bundle
- 38. DATA ANALYSIS All businesses need this!
- 39. QUERY VS PROCESSING • SQL is great because we can query by any field • There is no standard in NoSQL databases • NoSQL systems are more limited, only keys (some allow secondary indexes) or complex graph syntax • We sometimes need processing for complex queries
- 40. MAP-REDUCE
- 41. HADOOP VS SPARK • Techniques to extract subsets of the data (MAP) and operate them in parallel before aggregating (REDUCE) • Not real time, Hadoop the most popular • Apache Spark opens a new paradigm for near real-time • You need other languages for these techniques
- 42. FINAL THOUGHTS Now what?
- 43. ENGINEERING CHALLENGES • The Internet of things will generate real BIG DATA • SQL / ACID technologies are not going anywhere • Be very careful when using NoSQL in production • Databases… and life… are full of tradeoffs • The next decade will be fascinating for the industry
- 44. READ CAREFULLY THE DOCS
- 45. CHOOSE THE RIGHT TOOL
- 46. QUESTIONS? • Twitter: @ricardclau • E-mail: ricard.clau@gmail.com • Github: https://github.com/ricardclau • Please rate the talk at https://joind.in/talk/view/12958