Hace unos días tuvimos la visita en BEEVA de Richard T. Freeman, al que invitamos a impartir un meetup sobre Serverless en nuestras oficinas.
Richard es Lead Data Engineer y Architect en JustGiving, una empresa de tecnología con fines sociales y una de las plataformas online de donaciones y crowfunding más importantes (en total, JustGiving ha ayudado a 28.5 millones de usuarios de 196 países a recaudar 4.200 millones de dólares para alrededor de 27.000 proyectos solidarios).
Además, Richard también está certificado como AWS Solutions Architect y tiene experiencia analizando Big Data en la nube, en el análisis de datos no estructurados o de gran volumen y en ofrecer soluciones escalables. Durante su periodo laboral en la consultora tecnológica Capgemini trabajó en complejos proyectos para multinacionales de primer nivel y tiene experiencia en diferentes ámbitos empresariales.
La charla de Richard en el meetup hizo un recorrido sobre cómo sacar partido a AWS Lambda, Amazon S3 Amazon Kinesis o Amazon SQS / SNS, entre otros servicios, con el objetivo de crear en las empresas procesos serverless event-driven para procesar un flujo continuo de datos. En la charla se habló sobre patrones de uso de las arquitecturas serverless, buenas prácticas a la hora de usar estas arquitecturas y sobre las lecciones aprendidas con la experiencia de uso.
Aprovechamos para charlar un rato con él sobre tecnología. Aquí tienes la entrevista:
Which have been the best decisions you have taken to develop your professional career in the last years (programming languages, platforms, methodologies, tooling, best practices)?
I think reading lots of blog posts and books, and watching talks and tutorials related to data science and engineering is essential. On the programming side, I’m always improving my Python development skills. I like Python as it has the advantage of being one of the most popular data science languages, lightweight, interactive, supports C++ and CUDA optimisations, and scales out with the likes of pySpark. Another language I recommend is SQL which allows you to manipulate and analyse data efficiently and complements Python really well. It also scales for big data thanks to the likes of massively parallel processing (MPP) distributed database systems like Amazon Redshift and Azure SQL Data Warehouse, and Hadoop frameworks like Hive and Spark SQL.
What is current adoption level for cloud technologies in UK? What technical challenges you think those companies are facing?
Many companies have a lot of complexity in their legacy and monolith architectures. Startup without the constraints of legacy services, use open source and cloud only deployment, are able to build and respond to market demands more rapidly. For example, financial technology startups can compete with retail banks, on some of their verticals, with very small teams or low startup costs. Traditionally, public cloud has been seen as insecure as it is publically hosted, but this is incorrect as for example AWS and Azure both offer payment card industry data security standard (PCI DSS) compliance, support encryption at rest and transit, have complete network security, and offer managed distributed denial of service (DDoS) protection.
What is the trend (practices, technologies) to solve data transformation and data exploration?
One of the trends is the growth in use of application programming interfaces (APIs) for many tasks that would have traditionally either been built in-house or licensed from a vendor. Now there is a wider adoption of pay as you go APIs for natural language processing, image / video tagging, geo-location, payments, and analytics services. This allows developers to create complex and scalable services with a few API calls or less expert or specialised knowledge. There is also now an expectation that data can be exported from any hosted service so that it can be joined with other data sources to provide a more complete picture of the market, business and user base.
What is also starting to emerge is more companies moving from backend off-line data science models, to the deployment of on-line machine learning models into production, allowing for example consumer or business facing products to be more personalised and driven by user behaviour. For some data sources, the trend is also to move away from overnight batch jobs, to more near real-time streaming processing pipelines. This allows near-real time feedback enabling a business to make more refined predictions when fed into online machine-learning models.
More organisations will migrate their existing services to containers and serverless computing, with the databases being moved to managed services as well. There will be further improvements in continuous integration, build and deployment, e.g. serverless functions will be deployed as soon after code changes are checked-in.
Which sectors are taking advantage of their data?
All sectors are capturing structured or unstructured data, but many have failed to get the most value out of it. A lot of innovation seems to be done by system integrators, e-commerce apps/websites and vendors working across sectors especially in the consumer facing companies. There is a huge opportunity in all sectors of industry for them to become more data-driven organisations, where users, performance, sales, marketing are all measured and insight is provided.
Regarding the cloud, what are the benefits that Serverless provides to companies? What is the adoption level of serverless in companies?
The benefits of Serverless computing to companies are on speed of deploying a highly available and auto-scaling system, at a low cost to run and support. Low costs, as you only pay for the execution time of your code down to the 100ms rather than an always on cluster, machine or container.
In addition Lambda function event triggers, such as newly added streaming records or object uploads, are built in which means that a lot less code is required. The adoption levels are still low but with new serverless architectures, growing tools and added functionality this is increasing.
What are the next steps you think public cloud providers will take to mature Serverless?
They are working on code and infrastructure deployment pipelines, wider integration with other data APIs, services or vendor products. Adding more configuration flexibility and finer gain limits per groups would also be useful and help then get wider adoption.
What has been the greatest technical challenge you have faced in the last year?
For serverless and data streams this has been data quality and ensuring it gets fixed as close to possible at the source. To do this we have data specifications, validation schemas, different levels of quality assurance (QA) and monitoring processes that catch different types of errors in a data stream, e.g. invalid records, missing records, abnormal spikes or trends.