I’m 26, I studied at the Mines School in Nancy. At the time, I didn’t know where I was going: I decided to do a gap year. I did two internships during this year, in two different sectors. The first one was about portfolio management, the second one was about predicting sales in supply chain for Lu. By the end of this year, I chose to specialize in Data Mining and joined the ENSAE for a Master’s Degree in Data Science. My last internship was in Statistic Bio. I was analyzing data to detect cancers. I knew at the end of this internship that I wanted to continue the adventure in a startup as a data scientist. I got recruited by Dataiku and have been working there for a year and a half.
The job of data scientist is actually made of different type of profiles. You can find people specialized in machine learning, in architecture or in algorithms. The data scientist goes from raw data that cannot be interpreted; he must analyze which data must be extracted to turn them into an exploitable form, and make concrete indicators stand out. In order to do so, this scientist of data works with different algorithms, he must therefore have technical skills in development and mathematics. He analyzes data gathered by the company (clients, prospects, employees...), in order to use them for marketing, fraud detection and image recognition.
The data scientist is also in charge of identifying which levers are actionable for the business. He makes recommendations to improve the product / service or even the company’s performance. Obviously the data scientist is only a link in the chain, and works closely with other services (marketing or sales). After exploiting the data, he gives his recommendations, usually to the marketing department that will then transmit it to the sales department. Machine learning is quite important for the data scientist. It takes form as a join use of massive amount of data and algorithms, which allows to discover significant links between data. Sometimes a client asks us to work specifically on a problem, and we may find out thanks to data that other levers (initially not identified by the company) can improve its performance.
The tool the most used by data scientists is Hadoop, it allows you to operate massive data bases. I use the Python language for machine learning and the exploration of the data set (mostly small data bases). At Dataiku, we provide our clients with a tool called Data Science Studio, and I use it in my work every day. As for Business Intelligence, I use Vertica and Greenplum, very useful tools when you need to work on “vertical” data bases (column based).
In my opinion, the major difference lays in the type of data analyzed. A data analyst manipulates data more or less formatted that are immediately exploitable. The data scientist, however, starts with a blank page and an impressive amount of raw data. In the chain of work, he works upstream compared to the data analyst, and must extract data plus give it a meaning. It is quite the preliminary work of the data analyst, who is much more business oriented and stands at the end of the data science process.
Currently there are not many data scientists. It’s a job that will probably develop in the next few years. Companies needs are growing, and I think there will also be an increasing number of data scientists, in big groups as much as in startups. Data is a major stake for companies, we notice this everyday at Dataiku. In the future, we will witness a segmentation of the job: some data scientists will specialize in machine learning, others in architecture. I also believe that the counseling dimension will also develop a lot, to show companies what opportunities to seize thanks to data.
Curious about the salary of a data scientist? There you go.