WTF does a Data Scientist do all day?

A foreign customs agent recently asked me to list my occupation.  When I replied “Data Scientist”, it was clear from the look on her face that her confusion resulted from more than just the language barrier.  After a few minutes trying to describe what I do, the agent simply wrote “Scientist” on the customs paperwork.  I was now officially a scientist, at least in the eyes of the Colombian government.

Unfortunately, exchanges like this aren’t entirely uncommon.  While those in the data science and analytics communities are keenly aware of the data scientist’s job description, the profession isn’t quite as ubiquitous in the general population.  Everyone knows (at a high level, at least) what lawyers and doctors do for a living.  The term “data scientist”, however, often elicits blank stares and awkward pauses.  To be fair, this is partly our own fault.  Experts in the field often debate the skills and job description of a data scientist, and the profession itself is still rapidly evolving.  Combine that with the canonical visual representation of the field (see below), and the modern data scientist seems more like a unicorn than an actual individual.


Rather than rehash previous discussions, my goal here is to provide some boundaries surrounding the modern data scientist’s job description.  I’ll do this through the lens of my own day-to-day activities.  Part I of this post will list things definitely within the scope of my job description.  Part II, then, will list activities that fall well outside of my purview.  I hope to provide a very general overview, though it’s certainly possible that my job a) requires certain activities that aren’t required of other data scientists, or b) that other data scientists engage in activities that are well beyond the scope of what I’m asked to do.  This, at heart, is one of the largest debates amongst modern data scientists.  My goal here is simply to provide one practitioners point of view.

To provide some perspective, you should know that I consider myself your run-of-the-mill Data Scientist.  I don’t work for Apple or Google, and many of my data sets are small enough to fit into local memory.  I use Amazon Web Services and various forms of distributed processing more out of curiosity than necessity, but I occasionally run into a problem that can’t be solved on your standard laptop.  My academic background is in statistical modeling, research methods, experimental design, and program evaluation. However, I cut my professional teeth in fraud and credit modeling, and I utilize machine learning methods more frequently than I use classical statistical techniques.

With that in mind here are 8 things I definitely do as a data scientist…

1)  Provide in-house Statistical Consulting

Rarely does a day go by during which I’m not asked to interpret the results of some research or white paper, consult on an in-house analysis, recommend an appropriate experimental methodology, or conduct a simple statistical analysis for another department (like this, for example).  In some sense, this was a motivating factor for my original hire; the desire to have statistical expertise in-house.  This is certainly one of the more interesting aspects of my position, as it allows me direct access to cross-department initiatives, allowing me a more holistic view of the organization.

2)  Automate Data-driven Processes

Prior to my arrival, incoming data was manually screened and cleaned by members of the analytics team.  Similarly, cleaned data was manually uploaded into our data warehouse via a standard user interface.  These were both incredibly inefficient processes that both consumed resources and bogged down analytics productivity.  One of my first tasks was to automate these processes using statistical tools to screen incoming data and (my choice of) scripting languages to automate the process.  Automating data-intensive processes falls well within my job description, and several additional requests along these lines are currently in my pipeline.

3)  Develop Predictive Models

This is often considered the prototypical use-case for a data scientist.  An organization has a business problem and data to describe it; the data scientist, then, uses said data to develop a predictive model that ultimately augments the subjective judgements of the executive team.  Such models may be used to select targets of a marketing campaign, predict future customer performance and revenue (time series, particularly as implemented in the R ‘forecast’ package, modeling is useful to this end), predict customer retention (e.g., survival analysis), or evaluate new data sources.  This process often requires an implementation phase, whereby the data scientist either a) uses existing model delivery systems to deploy predictions, or b) develops a model delivery system from scratch.  While the former makes things much easier, the latter is often much more interesting, giving the data scientist complete control of the entire modeling process.

4)  Provide Useful Visuals and Summaries for Executive Management

This, in my estimation, is the key to success as a data scientist.  Data scientists possess a very broad array of highly technical skills.  Ultimately, these skills are put to use solving a variety of business problems.  Solutions to these problems must (typically) be approved by various members of the senior management team, most of whom do not possess the same skill set as the data scientist presenting the results.   Effective data visualization and communication is the key to organization wide adoption of any data science initiative.  Our visual perception is our most power sense, providing the data scientist with an incredibly powerful mechanism by which very complicated results can be communicated.

A little vision science knowledge goes a long way in that regard (I recommend the works of Edward Tufte and Stephen Few, both world renowned experts in the field of data visualization).  Equally important are verbal communications skills, particularly when it comes to knowing your audience.  The typical C-suite executive doesn’t care how you cleaned the data, what methods you investigated, what the ROC for the final model is, or how you intend to deploy.  The typical executive is more interested in a model’s impact on the business in terms of revenue and cost.  Knowing your audience is key.  The business case is typically more important than the technical details.

5)  Manage Data and Analytics Vendors

Even with a data scientist on staff, an organization will often utilize external vendors to both provide additional data and analytics resources.  As an example, an organization may use their customer information file to append marketing data to identify potential cross-sell opportunities.  Similarly, an organization with a small data science team may decide to outsource less important (but necessary) analytics tasks, or to partner with an analytics vendor that already possess the requisite technology to implement a solution. The data scientist, then, may be charged with keeping these vendors honest by managing the relationships and continuously questioning the services they provide.  In fact, the data scientist may be the only person qualified to effectively hold these vendors accountable.  Data and analytics vendors typically charge a premium for their services, and the data scientist should be responsible for ensuring these vendors provide the appropriate ROI.

6)  Use Data to Improve Products

This is often touted as one of the most powerful contributions a data scientist can make within an organization.  Frequently, I’m asked to evaluate a certain product or initiative owned by a particular department for both efficacy and potential modifications.  This requires an in-depth understanding of available in-house data, in addition to knowledge of  relevant reliable, publicly available data sources that may be required to produce viable results.  These projects are often the most interesting simply because they require a variety skills.

A typical data improvement project starts with a conversation with the product owners to understand what they expect the data to show.  This data is then acquired and cleaned, and augmented with relevant publicly available data either through an API or through use of a custom web-scraper.  This data is subsequently cleaned and merged with the in-house data, and a high level summary is presented to ensure the data aligns with product owner expectations.  Unexpected results are often fodder for additional analyses, and the process iterates between conversation and analysis until both parties are satisfied with the results.  (I’m skipping many of the details in the name of brevity; it’s clearly a much more involved process than I’ve laid out.)  These projects don’t always result in recommendations for product improvement, and the data scientist is responsible for justifying null results when they arise.

7)  Present Interesting Results to External Audiences

Data scientists are typically expected to be thought leaders with regards to their company’s data and technological infrastructure.  As a result, data scientists are often expected to present onsite during pertinent customer initiatives and at various analytics conferences and meet-ups.  These presentations are seen both as opportunities to display a company’s analytics sophistication and to gain exposure to other practitioners and emerging technologies.

8)  Keep Up with New Technologies

Of all my responsibilities, I lose the most sleep over this one.  Big Data and analytics technologies emerge and evolve at an alarming pace.  This Cambrian explosion of technologies is both exciting and daunting, and keeping up with everything can be utterly overwhelming.

Personally, I think it’s an incredibly important part of my job to at least myself with the available options, provide pros and cons when asked to evaluate new technologies, and provide sound reasons for preferring certain solutions over others.  Conference attendance and connections with other practitioners are incredibly important to this end.

 Is your experience as a Data Scientist the same or different? Have questions? Please post them in the comments.

5 tips for guys to save serious money wedding planning

sad-must-useSo you’re taking the plunge and starting to realize just how expensive weddings are. You’re probably wondering “how can I save the most amount of money possible?” Well the easy answer is: Get married at the courthouse. Sadly, if you are reading this right now…  it’s probably too late.

Whether you have to foot the bill or you are working within a budget someone is gifting to you. Everyone wants to save as much as possible. So here are some things to keep in mind when getting ready for her your big day that can save you some serious money.


1) Choose your venue wisely

If you can, help narrow down the venues online or over the phone before you start going to look in person. The worst thing that can happen is you both fall in love with a venue that brings you right at your budget before you’ve fed a single person.

Here are some things to consider which can make a huge difference in your bottom line.

1.  Are there Sundays or any day that isn’t Saturday available?

Sunday is almost always cheaper. The truth is, if you have people traveling they would likely have to take Friday off anyways. With a Sunday wedding they simply flip it to Monday and you can save thousands.

2. Can you use your own caterer or are you locked into their list? Is there an additional fee to use someone else?

A lot of places force you to use them for catering or have a preferred catering list. If this is the case, price out the catering before you let her step foot there. This will easily be one of your biggest expenses, don’t lose control of it early on.

3. What are their rules on alcohol? Can you buy it from a wholesaler or are you stuck paying their prices?

Right up there with food, alcohol costs multiply quickly. Make sure you know the venues policy. See if you can buy it from a wholesaler and hire your own bartenders. If you are so lucky, many wholesalers will buy unused bottles back. This sure beats paying another $20-$30 a head.

Try to get these questions answered before you visit and she can’t see herself being married anywhere else. The venue will prove to be the foundation of your expenses.

2) Invitations

What she sees.

What you see.

Invitations can literally cost you thousands of dollars if you start getting fancy and have a company do them for you. Well it is 2013 and snail mail is almost dead, help send it to its’ final resting place. Send custom emails to everyone or if you want to get a little fancier try a site like Greenvelope

If you have some people on the list who need a formal invite buy them and print them yourself. It really isn’t that hard and can save you big. Put your printer on photo quality and no one can tell.

Remind yourselves: A lot of people are going to probably throw these out the same day they get them.

3) Traditional registries are a trap

Why does everyone think you need a set of china? Yes, this may be is the only time in your life you have the chance for someone to buy expensive plates for you. But if someone gave money today would you really buy those? Wouldn’t you rather have cash to at least be given an option to get what you wanted?

Try out this site. It allows you to list items on a registry for those who insist on buying you a gift, but the great part is you don’t get the gift. At least not until you take the money from your account and buy it yourself.

Another good option is an Wedding Registry. Given its inventory, and the ability to make returns makes a gift from Amazon almost as good as cash.

If you really have to, make a small registry on a site for her bridal shower. But we all know cash is king.

 4) Rings for you and her

Maybe you already took care of her wedding band when you bought the engagement ring. Oh wait you didn’t know? She gets TWO rings! That’s right, you probably still need to buy her a wedding band to compliment that engagement ring. If she is looking for something with stones in it ask your families if they have any old diamond rings or earrings. You would be surprised. If someone in one or your families does and is willing to part with it, take it to a jeweler. You can save hundreds to thousands of dollars by having the stones reused in her ring and now it’s sentimental.

Now for your ring. You need to ask yourself, as a guy how much do you care about your ring? If you are willing to go with something simple then the internet is your best friend. Go get sized in a jeweler and then start looking on eBay. You can literally find men’s wedding bands for 5-10% of the cost you’ll pay in the jeweler. Since you are saving so much on it, maybe you can ask her for something that you really want. New watch?

 5) Take care of the big things first

You are inevitably going to get worn down with wedding planning and you will not agree on everything. The key to saving the most money is take care of all the big things first which will give you the most ROI on your research and time. At a minimum, be involved with the venue, alcohol, entertainment, and invites. After that try to save money where you can but wedding planning is already stressful and you probably don’t want to be involved in every decision.

Remember time is a finite resource in wedding planning… and in life. Spending an hour to research the cheapest drink straws might save you $10, but calling 5 additional vendors for a better price might save you thousands. Be wise with your time, and if you do find yourself with a DIY project ask yourself, “Is there something better I could be doing?”. For example, we decided to spend $20 on a ring bearers pillow rather than 2 hours to craft one ourselves. Knowing when to just spend some money is almost as important as knowing where to save it. Your sanity has a price.

Overall try to keep your head up, and in the end do whatever you can to not go into debt.

What are some tips you have for saving money? Feel free to enter them in the comments.

If you are looking for specific ways to save thousands of dollars sign up to be notified when my eBook is released. “Wedding hacks: Tips and tricks to save thousands”.

First 10 to signup will get a free early release copy.

  • Proven scripts to negotiate with vendors
  • Things caterers don’t tell you
  • How to find an awesome photographer at half the price
  • Modern templates to make do it yourself invitations
  • How to maximize ROI on your guest list
  • Much more!

We hate spam as much as you do. You'll only receive a single email when the book releases.
* Required
    This is a required question
    Never submit passwords through Google Forms.

6 Ways to Address Collinearity in Regression Models

Linear regression methods serve as a frequent launching point for those venturing into predictive analytics. Their roots lay in the advent of classical statistics (Francis Galton, cousin of the famed Charles Darwin, mentions “regression toward mediocrity” as far back as 1886), making them amongst the most heavily investigated and well-understood statistical methods. Their ease of interpretation makes them incredibly powerful in the context of controlled experiments, and their inclusion in nearly all modern statistical computation packages renders them approachable to even novice statistical modelers.


Unfortunately, linear regression methods come with a list of caveats, many of which are no different than the caveats associated with most methods hailing from classical statistics. More patient people than us have outlined these assumptions at length (no intro stats curriculum would be complete without a discourse on the assumptions required for the proper application of linear regression methods). Consequently, this post will focus on a problem that, in the author’s opinion, is far too frequently overlooked in academic settings: the problem of predictor collinearity.

In statisticscollinearity refers to an exact or approximate linear relationship between two explanatory variables

Standard linear regression methods are known to fail (or, at least, perform sub-optimally) in the presence of highly correlated predictors. If the ultimate goal of the analysis is prediction (as opposed to interpretation of specific predictor-outcome relationships), some additional processing may be needed in order to produce a viable predictive model. In no particular order, we present six ways to deal with highly correlated data when developing a linear regression model. It should be noted that the recommendations below apply specifically to continuous outcome models, i.e., models in which the dependent variable is a real-valued number.

*Note: This is, by no means, meant to provide a thorough, technical overview of each topic. Instead, our goal is to identify some of the potential solutions to the collinearity problem in linear regression, spark conversation amongst practitioners and enthusiasts, serve as a starting point for those venturing into the realm of predictive analytics, and provide links to some of the relevant additional reading on each topic.

1) Manual Variable Selection

Highly correlated predictors contain redundant information. Consequently, removing individual features that are highly correlated with other predictors may produce viable predictive models with little loss in predictive power. The Variance Inflation Factor (VIF) of a predictor can be used to identify and eliminate potentially redundant features.  This method allows for a high degree of user input, but can prove tedious in cases with datasets containing excessive numbers of potential predictors. Alternatively, univariate correlations can be used to identify candidate predictors. This approach, however, necessarily ignores multivariate relationships in the data. Multicollinearity (i.e., predictors that are related to linear combinations of other predictors) may still prove an issue after univariate correlations are considered. This is especially true in large, high dimenstional data sets.

2) Tree-based Automatic Variable Selection


The canonical decision tree example: predicting outside play based on current weather conditions.

Decision trees provide a sort of automatic variable selection, as tree-based methods only include features that provide a legitimate contribution to the model’s overall performance. These methods are typically easy to implement and interpret, with feature selection resulting as a necessary part of the tree building process. More advanced tree-based methods (such as bagged or boosted trees), however, require an additional tuning parameter that must be manually specified by the user. Although a variety of techniques exist to augment this selection process, this adds a level of complexity to overall model development. Addtionally, tree-based regression methods provide viable predictive models in and of themselves. Using tree-based methods to select inputs for a linear regression model may be unnecessary, as tree-based methods produce useful, stand-alone predictive models.

3) Regression-based Automatic Variable Selection

Stepwise regression methods, including forward and backward elimination methods, use various statistical criteria to iteratively add and remove potential features. The result is a statistically produced final model with (typically) decent statistical properties. The downside, however, is that the user has very little control over which variables are ultimately selected. When all available predictors are considered as candidates for model inclusion, these methods may result in models that fail to generalize beyond the training set.  Additionally, these methods do not explicitly address predictor collinearity, and additional processing may be required after the final model is produced.

4) Variable Reduction via Principal Components Analysis (PCA)

PCA consumes entire chapters in academic texts (and, in fact, entire texts themselves), precluding a comprehensive overview in the current post. (Very, very, VERY) Briefly, PCA attempts to find linear combinations of predictors that are a) uncorrelated with each other, and b) explain as much of the variance in the feature-set as possible. In the context of correlated predictors, PCA can be used to create a set of predictors that are completely uncorrelated with each other. These predictors can then be used as inputs to any subsequent regression model (a procedure commonly referred to as “prinicpal components regression”). This is an incredibly simplified explanation of the process. We’ve found this explanation to be a decent starting point for those interested in additional reading.

In this example, temperature and pressure are projected onto their two principal components, v1 and v2. This results in a set of orthogonal (i.e., uncorrelated) predictors. In practice, it’s like only v1 would be kept, as this principal component accounts for the majority of the available variance.

5) Variable Reduction via Partial Least Squares (PLS)

PCA (as described above) creates uncorrelated predictors without accounting for these predictors’ relationships with the outcome of interest. This can prove problematic in cases where the principal components (the uncorrelated predictors, so to speak) are unrelated to the outcome of interest. PLS, by comparison, takes a slightly different approach, accounting for the predictor-outcome relationship while reducing the number of candidate predictors. The result is a reduced feature-set that has been selected based on its relationship with the outcome of interest.

6) Parameter Estimation via “Shrinkage” Methods

The cost function in conventional linear regression minimizes the sum of squared differences between the observed data points and the data points predicted by the linear regression model. In the presence of correlated predictors, minimizing this cost function can result in inordinately large regression coefficients as the method has difficulty quantifying the relationship between an outcome and any number of highly correlated predictors. To account for this, “shrinkage” methods add an additional penalty term to the cost function. This penalty term keeps the estimated value of the regression coefficients small, thereby reducing the inflation often seen in the presence of correlated data.

Three types of penalties are commonly used to reign in inflated coefficients resulting from collinearity. Ridge, or L-2, regression, adds a penalty based on the sum of squared regression coefficients, resulting in estimates that are artifically shrunk towards zero. Lasso, or L-1, regression, penalizes the absolute value of the sum of regression coefficients, usually resulting in several zero-valued coefficients and effectively serving as a variable reduction technique. Finally, elastic net uses both penalty terms, allowing the user to specify which penalty plays a larger role in reigning in the coefficients.

Recommended Readings:

Max Kuhn and Kjell Johnson provide an excellent overview of these methods in their book “Applied Predictive Modeling“. (This post was, in fact, largely inspired by my first pass through the book.) It’s one of the better references I’ve found with it comes to applied predictive modeling, complete with R code to augment the in-text examples.