From galaxies to machine learning: one data scientist’s journey

From galaxies to machine learning: one data scientist’s journey

After working around the globe as an astrophysicist and university professor, Dr. Sébastien Foucaud made his way into the world of data analytics, machine learning and big data technology. Today, he’s head of data science at Scout 24, a real estate and automotive marketplace, where he leads data science projects that shape product development. He sat down with us to share his rich experience working with diverse types of data, from telescope imaging data to digital analytics data from AT Internet. Read on to learn how machine learning can help businesses, his advice for data scientists who are just starting out, and how he sees the field evolving in the near future.

 

 

You’ve paved a rather unique career path. How did your work in astrophysics lead you to your profession today as a data scientist?

At age 7, I got my first telescope and decided I wanted to be an astronomer. Astrophysics is a highly analytic field – I didn’t see it that way at the time, but it has become very important to my work today.

After getting my PhD in astrophysics from the University of Provence in Marseille, I worked for several years gathering data from telescopes and developing software to extract this information automatically. In 2005, I led a project with the University of Nottingham involving a telescope in Hawaii that generated 100 terabytes of data each night. We developed technology to extract information from the flow of images generated by that telescope, transfer these data to a database in Scotland, and update the data automatically. Our goal was to process data from the night in real time. We had gathered huge amounts of data in the database, but much of it went unexploited as the technology at that time was not sufficient to support exploiting these volumes of data.

This became one of the problems I wanted to tackle – exploiting large masses of data. As astrophysics projects are often government-funded, the data is publicly available, and there’s lots of it. Ideally, you don’t even need to go back to the telescope because the data has already been collected and stored somewhere. So I began focusing on the question, how can we use technology to go back and exploit those existing databases? That’s how I got into computer science and was first exposed to Hadoop. My goal was to gather all this information in one place, run analytics on those data, exploit tens of billions of entries at once, and do data science with it.

I spent 4 years in Taiwan as a university professor, where I got an astronomical data centre up and running. The data was viable and we could calculate simple statistics (like mean or median) for tens of billions of data points, which had not been possible before. Once we had access to such a statistically viable dataset, we could begin thinking about how to use machine learning to classify objects and derive insights, such as the distances of galaxies.

I then spent 2 years as a professor in Shanghai, where I helped build a similar data centre at an even larger scale. But it was around this time that I realised what I wanted to do in life had shifted from astrophysics to data science for product development: using data to find solutions to help people and companies better do whatever they’re looking to do.

 

What types of applied data science and machine learning projects have you worked on?

After I left China and academia, I moved to a Berlin to join a P2P lending start-up as head of data science, where my role was to ramp everything up using data science and machine learning, starting with risk management and assessment. For example, when someone wants to borrow money, how could we assess the risk associated with lending to that individual? Solving this was very similar to how we measure the distance of galaxies – it’s a classification problem. You create a model based on historical data, classifying into different “boxes” (whether those “boxes” are different levels of risk, or different-coloured galaxies…). Both are questions of classification based on existing information.

I then moved to OLX Group, a global network of classifieds platforms, where I built a team of data scientists, data analysts and data engineers from the ground up. Our job was to use the available data to solve company problems, such as automating ad moderation to the tune of 2 million daily listings. Once again, we solved this using classification: We had a wealth of data on ads that had been manually approved by customer service reps in the past, which we used to create models for automatic ad moderation. In my time at OLX, I also worked on projects involving better exploiting text and images in ad listings, classifying images based on their quality to enhance the user experience, and automatically recognising and extracting features from ad images. It was then that I began working with AT Internet digital analytics data, which was the foundation for all our core KPIs at OLX.

 

What types of projects are you leading in your current role? What does your team look like?

At Scout 24, I lead about 10 data scientists and 20 data engineers in cross-functional teams, working in a very agile, lean environment. Our data engineers focus on building up our data stack, whereas our data scientists are very business-driven and focus on using machine learning to develop products that will enhance the company’s value for users. It’s a tech-driven company that has understood that data is its wealth, which is why we’re ramping up the data science team slowly but steadily.

One of our main projects is a recommendation engine, launched at the end of 2016. This engine works in two ways: based on user similarity (where we recommend listings based on what others have consulted, i.e. “Users who viewed this apartment listing also viewed these other listings”), and feature similarity (where we recommend listings based on similarities in the listings’ features, i.e. “You’re looking at blue automatic 4-door cars, here are other listings featuring blue automatic 4-door cars”). We’re also working on automatic detection of images and features in those images, such as being able to detect whether it’s an Audi or a BMW in the photo, or if it’s a kitchen or a bathroom.

How is this project impacting the business?

We’ve seen great results with the recommendation engine – a clickthrough rate 6 times higher than on normal search. In other words, with regular search, our consumers need to go through 50 listings to find what they are looking for, whereas they only need to look at 8 of our recommended listings instead! In addition, there is only 30% overlap with listings they’ve already consulted, meaning 70% of these recommendations are new listings we’re helping them discover.

What mistakes do companies commonly make when implementing data science initiatives?

Many companies put the cart before the horse: They ramp up in terms of infrastructure, they invest massively in clusters and servers, on premises or in the cloud to collect data, they hire data scientists… all before knowing specifically what they want to achieve. Then they’re disappointed when they realise their data scientists can’t just solve their problems.

First, companies must know what they want to do and which business cases they want to solve. Then, they must determine whether they already have the data to solve those problems, and if not, how they will acquire the necessary data. After that, they must stream data to a place where it can be used (usually cloud-based), then determine what types of data science skillsets are needed to solve these specific problems, and then find a data scientist who fits the bill. This is where it’s essential to partner with a good data strategist and a quality provider of data.

I’m currently advising a start-up (certace) which matches highly qualified data scientists to projects at Fortune 500 or 1000 companies. The freelance model is popular with data scientists, as we’re not necessarily interested in a specific company, but rather in specific types of projects. This approach is also interesting for companies – they don’t have to build their own data science team, but can still work with expert data scientists who are properly matched to the right project.

“Data scientist” can be a broad description. What type of data scientist do you consider yourself to be?

I consider myself a “data strategist” – someone who understands what the business needs, who can transform that into a framework, and then use data science to create a solution for the company.

Indeed, there are all different breeds of “data scientists” and they run the gamut from very business-driven analysts who can use Python or Scala to extract information and transform it into business insights using a machine learning toolbox, to developer-type data scientists who can use machine learning and advanced technology to develop products that can be scaled on the platform. These are fundamentally two different types of profiles: one is very much on the analytics side of things and understands the business, whereas the other is an engineer with a strong dev background.

What they must have in common is some sort of business understanding, to different degrees. Of course, not everyone needs an MBA, but if you want to work efficiently as a data scientist in a business environment, you must be able to understand your impact on company performance.

In your experience, what background or skills are necessary to be a good data scientist?

There is not one single path that must be taken; many people with diverse backgrounds and training become data scientists. In my case, I have a very strong background in statistics, which I think is essential. Why? As a data scientist, you must be able to understand what can be exploited from your data sets, and whether your data set is good enough – big enough, unbiased, etc. – to provide all the information necessary to solve the problem. Statistics is definitely at the core of this.

I would say it’s also important to have a broad understanding of how to stream data. Though I’m not coding things directly, I have solid technical knowledge of algorithms and how they work, and I understand the architecture and infrastructure behind it, especially with Spark.

 

What advice would you give to someone interested in becoming a data scientist?

First, determine whether you want to apply your knowledge to helping a business solve its problems, or whether you want to work purely on the research side.

Try an internship – there are more and more data science internship offers out there. Take the opportunity to see what you like. You might specialise in one direction later in your career, but for your first jobs and while you’re still learning, it’s about finding what you like.

Don’t get stuck focusing too much on any one particular language. If you’re comfortable with Python, then the company should adapt to Python. If you’re working with R, the company should adapt to R. Don’t be too closed-minded about one technology over another… it won’t matter anyway in 5 years as things evolve and you have to learn the newest language!

Data scientists are always focused on technology and algorithms, but my advice is to seek out the business value in what you do, and understand the impact on the company. This will make the difference between being a good data scientist, and being a good data scientist with a job!

What do you like most about your profession? What motivates you?

What I like most is developing products and the strategic side of solving problems for companies. I’m very interested in how we can expand data science usage to other types of fields and businesses, outside of the digital, ecommerce and classifieds sectors. What motivates me is helping companies understand what they want to achieve, and then accomplishing it with the right data, developments, and data scientist talent.

How do you see data science and machine learning evolving over the next decade?

Globally, we’re moving more to the cloud, and we’re increasingly able to solve problems now that we are scaling up: Clusters are cheaper and more powerful, we can stream data at scale. This is a good starting point and it is only going to improve over time.

I don’t see the technology ramping up disruptively, but rather machine learning will scale up and across a wider field of application. The maturity of machine learning is such that we can touch any kind of domain today, so I see data science being used and embraced in new sectors that were previously closed. Blockchain technology, augmented reality, virtual reality, speech recognition and neural-net will all have an impact in the next decade.

And the million-dollar question: Will machines overtake us all? :-) 

In the future, we’ll become even more interdependent on machines as they become more responsive. If we look at how we’re solving problems today using artificial intelligence and machine learning, it’s a question of optimisation – being more efficient, making more money. If we generalise that and try to optimise all around us, how long will it take the machines to realise that the humans are actually the bottleneck to these optimisations? [laughs]

We shouldn’t be afraid of this, though, because we can use this technology to our advantage in the interest of “augmented humanity”… bettering ourselves thanks to technology so that we won’t need to be replaced.

Many thanks to Dr. Foucaud for sharing his experience with us! You can reach him at SBSF Consultancy. Stay tuned to the blog in coming weeks for more data science discussion with Dr. Foucaud!

Related Posts

Leave A Comment

Loading Facebook Comments ...