A foreign customs agent recently asked me to list my occupation. When I replied “Data Scientist”, it was clear from the look on her face that her confusion resulted from more than just the language barrier. After a few minutes trying to describe what I do, the agent simply wrote “Scientist” on the customs paperwork. I was now officially a scientist, at least in the eyes of the Colombian government.
Unfortunately, exchanges like this aren’t entirely uncommon. While those in the data science and analytics communities are keenly aware of the data scientist’s job description, the profession isn’t quite as ubiquitous in the general population. Everyone knows (at a high level, at least) what lawyers and doctors do for a living. The term “data scientist”, however, often elicits blank stares and awkward pauses. To be fair, this is partly our own fault. Experts in the field often debate the skills and job description of a data scientist, and the profession itself is still rapidly evolving. Combine that with the canonical visual representation of the field (see below), and the modern data scientist seems more like a unicorn than an actual individual.
Rather than rehash previous discussions, my goal here is to provide some boundaries surrounding the modern data scientist’s job description. I’ll do this through the lens of my own day-to-day activities. Part I of this post will list things definitely within the scope of my job description. Part II, then, will list activities that fall well outside of my purview. I hope to provide a very general overview, though it’s certainly possible that my job a) requires certain activities that aren’t required of other data scientists, or b) that other data scientists engage in activities that are well beyond the scope of what I’m asked to do. This, at heart, is one of the largest debates amongst modern data scientists. My goal here is simply to provide one practitioners point of view.
To provide some perspective, you should know that I consider myself your run-of-the-mill Data Scientist. I don’t work for Apple or Google, and many of my data sets are small enough to fit into local memory. I use Amazon Web Services and various forms of distributed processing more out of curiosity than necessity, but I occasionally run into a problem that can’t be solved on your standard laptop. My academic background is in statistical modeling, research methods, experimental design, and program evaluation. However, I cut my professional teeth in fraud and credit modeling, and I utilize machine learning methods more frequently than I use classical statistical techniques.
With that in mind here are 8 things I definitely do as a data scientist…
1) Provide in-house Statistical Consulting
Rarely does a day go by during which I’m not asked to interpret the results of some research or white paper, consult on an in-house analysis, recommend an appropriate experimental methodology, or conduct a simple statistical analysis for another department (like this, for example). In some sense, this was a motivating factor for my original hire; the desire to have statistical expertise in-house. This is certainly one of the more interesting aspects of my position, as it allows me direct access to cross-department initiatives, allowing me a more holistic view of the organization.
2) Automate Data-driven Processes
Prior to my arrival, incoming data was manually screened and cleaned by members of the analytics team. Similarly, cleaned data was manually uploaded into our data warehouse via a standard user interface. These were both incredibly inefficient processes that both consumed resources and bogged down analytics productivity. One of my first tasks was to automate these processes using statistical tools to screen incoming data and (my choice of) scripting languages to automate the process. Automating data-intensive processes falls well within my job description, and several additional requests along these lines are currently in my pipeline.
3) Develop Predictive Models
This is often considered the prototypical use-case for a data scientist. An organization has a business problem and data to describe it; the data scientist, then, uses said data to develop a predictive model that ultimately augments the subjective judgements of the executive team. Such models may be used to select targets of a marketing campaign, predict future customer performance and revenue (time series, particularly as implemented in the R ‘forecast’ package, modeling is useful to this end), predict customer retention (e.g., survival analysis), or evaluate new data sources. This process often requires an implementation phase, whereby the data scientist either a) uses existing model delivery systems to deploy predictions, or b) develops a model delivery system from scratch. While the former makes things much easier, the latter is often much more interesting, giving the data scientist complete control of the entire modeling process.
4) Provide Useful Visuals and Summaries for Executive Management
This, in my estimation, is the key to success as a data scientist. Data scientists possess a very broad array of highly technical skills. Ultimately, these skills are put to use solving a variety of business problems. Solutions to these problems must (typically) be approved by various members of the senior management team, most of whom do not possess the same skill set as the data scientist presenting the results. Effective data visualization and communication is the key to organization wide adoption of any data science initiative. Our visual perception is our most power sense, providing the data scientist with an incredibly powerful mechanism by which very complicated results can be communicated.
A little vision science knowledge goes a long way in that regard (I recommend the works of Edward Tufte and Stephen Few, both world renowned experts in the field of data visualization). Equally important are verbal communications skills, particularly when it comes to knowing your audience. The typical C-suite executive doesn’t care how you cleaned the data, what methods you investigated, what the ROC for the final model is, or how you intend to deploy. The typical executive is more interested in a model’s impact on the business in terms of revenue and cost. Knowing your audience is key. The business case is typically more important than the technical details.
5) Manage Data and Analytics Vendors
Even with a data scientist on staff, an organization will often utilize external vendors to both provide additional data and analytics resources. As an example, an organization may use their customer information file to append marketing data to identify potential cross-sell opportunities. Similarly, an organization with a small data science team may decide to outsource less important (but necessary) analytics tasks, or to partner with an analytics vendor that already possess the requisite technology to implement a solution. The data scientist, then, may be charged with keeping these vendors honest by managing the relationships and continuously questioning the services they provide. In fact, the data scientist may be the only person qualified to effectively hold these vendors accountable. Data and analytics vendors typically charge a premium for their services, and the data scientist should be responsible for ensuring these vendors provide the appropriate ROI.
6) Use Data to Improve Products
This is often touted as one of the most powerful contributions a data scientist can make within an organization. Frequently, I’m asked to evaluate a certain product or initiative owned by a particular department for both efficacy and potential modifications. This requires an in-depth understanding of available in-house data, in addition to knowledge of relevant reliable, publicly available data sources that may be required to produce viable results. These projects are often the most interesting simply because they require a variety skills.
A typical data improvement project starts with a conversation with the product owners to understand what they expect the data to show. This data is then acquired and cleaned, and augmented with relevant publicly available data either through an API or through use of a custom web-scraper. This data is subsequently cleaned and merged with the in-house data, and a high level summary is presented to ensure the data aligns with product owner expectations. Unexpected results are often fodder for additional analyses, and the process iterates between conversation and analysis until both parties are satisfied with the results. (I’m skipping many of the details in the name of brevity; it’s clearly a much more involved process than I’ve laid out.) These projects don’t always result in recommendations for product improvement, and the data scientist is responsible for justifying null results when they arise.
7) Present Interesting Results to External Audiences
Data scientists are typically expected to be thought leaders with regards to their company’s data and technological infrastructure. As a result, data scientists are often expected to present onsite during pertinent customer initiatives and at various analytics conferences and meet-ups. These presentations are seen both as opportunities to display a company’s analytics sophistication and to gain exposure to other practitioners and emerging technologies.
8) Keep Up with New Technologies
Of all my responsibilities, I lose the most sleep over this one. Big Data and analytics technologies emerge and evolve at an alarming pace. This Cambrian explosion of technologies is both exciting and daunting, and keeping up with everything can be utterly overwhelming.
Personally, I think it’s an incredibly important part of my job to at least myself with the available options, provide pros and cons when asked to evaluate new technologies, and provide sound reasons for preferring certain solutions over others. Conference attendance and connections with other practitioners are incredibly important to this end.
Is your experience as a Data Scientist the same or different? Have questions? Please post them in the comments.