Es speaker habitual en conferencias y meetups por todo el mundo, es contribuidor en Apache Spark, la comunidad Netflix Open Source, fundador de Global Advanced Spark y el TensorFlow Meetup. Chris es también autor de un libro que publicará próximamente, Advanced Spark, y creador de una serie de videos en O’Reilly, Deploying and Scaling Distributed TensorFlow in Production.
Previamente, fue Distributed System Engineer en Netflix, Data Solutions Engineer en Databricks y miembro fundador de IBM Spark Technology Center en San Francisco.
Esta semana Chris está en España para participar en Big Data Spain y, de paso, participó ayer en nuestro meetup sobre Big Data en Amazon Web Services junto a nuestro compañero Iván Fernández Perea. Además, impartirá un workshop este próximo sábado en BEEVA: Ent-to-End Streaming ML Recommendation Pipeline Spark 2.0, Kafka, TensorFlow... Si te interesa, puedes consultar la información y comprar las entradas en Eventbrite.
Y aprovechando su paso por aquí, aquí te dejamos una entrevista con Chris Fregly.
What is the purpose of your visit to Spain?
I’m very excited to be back in Spain! I was invited to keynote Big Data Spain 2016 along with my some of my friends and colleagues including Paco Nathan and Michael Hausenblas.
Previously, I spoke at the Madrid and Barcelona Apache Spark Meetups last year to very full and curious crowds. There’s a lot of good tech talent in Spain. I had some really great conversations. And I’m excited to meet more of the technical community in Madrid, specifically.There's a lot of good tech talent and tech communities in Spain by @cfregly Click Para Twittear
Which will be the content and the focus of your workshop next November 19th about End-to-End Streaming ML Recommendation?
The goal of the workshop is to build an end-to-end recommendation engine including everything from the HTML-based UI to the Cassandra-based storage layer to the REST-based Prediction Services.
The workshop is always evolving to keep up with technology trends. At the moment, I’m focused on Kubernetes, Docker, Kafka Streaming, Spark ML, Tensorflow AI, Redis, Cassandra, ElasticSearch, and High-performance REST-based Prediction Services using Netflix Open Source.
It’s a unique combination of data science and data engineering that is rarely covered in a single day-long event. We cover a lot of material in a short amount of time, but also spend time discussing use cases and real-world scenarios.
You also have participated with BEEVA in its Big Data on AWS Meetup, what did you talk about?
As a meetup organizer myself, I get very excited to speak at other meetups across the globe!
For the BEEVA Innovative Technology Meetup, I likely focus on the Spark ML and TensorFlow AI Prediction Layer. This is the often-overlooked “last mile gap” that extends trained ML and AI models into user-facing production. Today, this gap is primarily filled by proprietary cloud provider services from AWS, Azure, and Google.
At PipelineIO, we’re combining our Databricks and Netflix experience to build a high-performance, open source ML and AI Serving Layer option to compete with these proprietary cloud services. In addition, our goal is to enable data scientists to safely and quickly perform their ML and AI experiments on live, streaming, production data.
What do you know about the Spanish technological companies and about the development of big data and machine learning projects?
In Oct 2015, I was fortunate to attend an IoT conference in Barcelona. This was a great introduction to the types of commercial innovation happening in Spain.
In addition, Jordi Torres, Professor UPC Barcelona Tech, invited me to speak at his Barcelona Apache Spark Meetup. This was a great opportunity to meet the local big data community. I was very impressed with the questions and feedback from both the academics and commercial audience members.
I didn’t have time to see the Barcelona Supercomputing Center with my own eyes, but that is a goal of mine for 2016!
I am excited to learn more about the Madrid big data community. My last stay in Madrid was very short, so I hope to meet more people at Big Data Spain.
Which are the challenges and difficulties that companies have to deal with to adopt these technologies?
This question is best-answered with the details of my own personal experience moving (physically and logically) from a medium-data, medium-scale community (Chicago) to a big-data, large-scale community (San Francisco). Long queries, lots of insight and metrics, requires fast experimentation, thin barriers, elastic and scalable infrastructures, etc.
Which are the industries that could optimize and take the most advantage this technologies?
Obviously, IoT is an industry that generates a lot of data. However, this applies to all industries as the amount operational metrics and insight data increase across all apps big and small…
Who are becoming a reference in the Machine Learning environment?
I have a lot of respect for smaller companies like Clariai and OpenAI who are pushing the boundaries of AI including supervised image/video classification, unsupervised reinforcement learning, transfer learning, natural language generation, adversarial networks, convolutional networks, recurrent networks, etc.
Obviously, the larger companies like Google and Facebook have been very generous with their open source offerings throughout the years. Most-recently, Google donated Tensorflow to the open source community. With nearly 40,000 stars and 20,000 forks on Github, there’s no question that Google is a leader in AI research and development.There's no question that Google is a leader in AI research and development by @cfregly Click Para Twittear
Which is your prediction for the future? What do you think about the evolution and the future of Spark and Tensorflow?
I see Spark and Tensorflow as distinct technologies with distinct roles in ML and AI model training. Spark is great at parallel, batch workloads such as ETL and MapReduce. It’s also very good for highly-parallel ML tasks such as model hyper-parameter tuning including grid search.
While there have been some clunky grass-roots efforts to integrate Spark and Tensorflow (ie. TensorFrames), my prediction is that these technologies will continue to improve along their dominant axis: batch ETL and MapReduce for Spark – and distributed ML/AI training and serving for Tensorflow.
I want to point out that Tensorflow Serving was a first-class citizen in Google’s initial Tensorflow offering in November 2015. Tensorflow Serving is actually my favorite part about Tensorflow. It’s really the first time we see a high-profile, open source machine learning project acknowledge the often-overlooked serving (aka prediction) portion of the extended ML/AI pipeline.
Tensorflow Serving boasts some really advanced features including request batching, model versioning, rollbacks, monitoring, metrics, etc. At PipelineIO, we’ve incorporated a lot of these concepts into our NetflixOSS-based serving layer for Spark ML and PMML-based models.
Lastly, it’s worth mentioning Tensorboard for insight into AI model training. Again, this is really the first time we see a high-profile, open source machine learning project emphasize metrics and insight into the training process. I always emphasize Tensorboard during Tensorflow workshops and demos. This type of insight allows us to visualize the training process – and also allows early model training termination, etc.
What would you say to the readers to encourage them to join us during next Saturday workshop?
There’s no substitute for in-person collaboration and knowledge sharing. I love meeting the local communities face-to-face to address their specific use cases, discuss trade-offs of the various technologies like Spark and Tensorflow, and grab a few cervezas together! 🙂
Imagen: DBTB INT (Youtube)