oding-question?id=10300&python=)
There are usually several concepts interviewers are testing for on data science interviews but since they might only have time to ask 1-2 questions, they'll try to pack the concepts into one question. So it's important to know what these concepts are so you can look out for them in an interview.
So what are they really testing for? Really what an interviewer is looking for are interviewees with an in-depth understanding of metric design and implementation of a real-world scenarios that would be present in the data. The key phrase here is "real-world scenario", which means that there are probably going to be multiple edge cases and scenarios you'll need to think through to solve the problem. There are 3 common concepts that they test for that test your understanding of how to implement code that solves real-world scenarios.
Since they only have time to ask 1-2 questions in an interview before their time is up, you'll often see all 3 concepts wrapped in one question. I see this question, or a version of this question, ( /coding-question?id=10300&python= ) on almost every interview I've been on or given. Follow along with me and see if you would be able to answer this question.
The 3 concepts you need to know are CASE statements, JOINs, and subqueries/CTEs. Let's go through a real interview question that cover these 3 concepts and talk about them in-depth. The link to the question is here ((coding-question?id=10300&python=) if you want to follow along.
Aggregates from CASE STATEMENTs
You'll likely get some sort of categorization question where you need to categorize data based on values you see in the table. This is super common in practice and you'll likely always be categorizing and cleaning up data. So a CASE statement is the simplest technique to test for.
Add the addition of aggregates like sum() and count() and they'll be testing to see if you actually know what is being returned in a case when, not just the implementation of it. Based on the case statements, you can always add an aggregate functions like a count or a sum.
Here is an example of a CASE statement with a simple aggregation in the SELECT clause for the question.
You see in the CASe statement below, we're categorizing users based on if they are paying customers or mot. We then apply a sum() as it's a quick way to count the number of paying customers vs non-paying customers in one simple query. If we did not have the CASE statement, it would take us two queries to find both numbers.
SELECT date, sum(CASE
WHEN paying_customer = 'yes' THEN downloads
END) AS paying,
sum(CASE
WHEN paying_customer = 'no' THEN downloads
END) AS non_paying
FROM ms_user_dimension a
JOINs
The 2nd concept is JOINing tables. Can you join tables? This is the lowest bar you need to jump over to be an analyst, much less a data scientist. This bar is basically on the ground so you can really just step over it.
So on interviews -- do they usually do a LEFT JOIN, CROSS JOIN, INNER JOIN? Most of your work will be using a LEFT JOIN so they're testing you based on practicality. You'll almost never use a cross join. You'll use an inner join quite a bit but left join is slightly more complicated so they'll use that just as an additional filter.
Self joins are common because it's not always obvious you'd be using that. But they're common in practice.
In the below example, we're joining tables to the CASE statement. We're joining two tables to our main table using a LEFT JOIN.
SELECT date, sum(CASE
WHEN paying_customer = 'yes' THEN downloads
END) AS paying,
sum(CASE
WHEN paying_customer = 'no' THEN downloads
END) AS non_paying
FROM ms_user_dimension a
LEFT JOIN ms_acc_dimension b ON a.acc_id = b.acc_id
LEFT JOIN ms_download_facts c ON a.user_id=c.user_id
GROUP BY date
ORDER BY date
Subquery/CTE
The last common concept is a subquery/CTE, basically some concept where you're doing some work and then need to do more work on it. This is testing to see if you can break up your problem into logical steps. Some solutions take more than one step to solve so they're testing to see if you can write code that follows a logical flow. Not necessarily complicated or complex, but multi-step and pragmatic. This is especially useful in practice because you'll 100% be writing code that's over hundreds of lines long and you need to be able to create solutions that follow a good flow.
In the below example, I'm taking the query we wrote above and putting it in a subquery so that we can query its data. This way we can apply an additional filter in the HAVING clause and keep the entire solution to one query.
SELECT date, non_paying,
paying
FROM
(SELECT date, sum(CASE
WHEN paying_customer = 'yes' THEN downloads
END) AS paying,
sum(CASE
WHEN paying_customer = 'no' THEN downloads
END) AS non_paying
FROM ms_user_dimension a
LEFT JOIN ms_acc_dimension b ON a.acc_id = b.acc_id
LEFT JOIN ms_download_facts c ON a.user_id=c.user_id
GROUP BY date
ORDER BY date) t
GROUP BY t.date,
t.paying,
t.non_paying
HAVING (non_paying - paying) >0
ORDER BY t.date ASC
Those are the three most common concepts that are tested during data science coding interviews. They're common concepts because they appear on the job almost everyday, and to be a successful data scientist, you need to learn how to implement and code these solutions. If you want more information on these three concepts, feel free to watch the video
Comments
Post a Comment