• [kaggle] Instacart Market Basket Analysis Summary

    The second kaggle competition I’ve participated just ended yesterday. Unfortunately, due to a wrong selection of the submission and poor local cross-validation strategy, I ended up in the top 14% without any medals. But I have learned so much from this competition and also from others kagglers. And since this competition ‘business problem’ can be replicated somehow somewhere in the real world, I decided to write a short report of the features and the takeaway of this competition.

    1. Problem formulation


    This Instacart market basket analysis competition aimed to predict which product will a user buy again over time. The business problem here could be translated into ‘Can I know my customer well enough to prepare a grocery list in their account so that they don’t have to manually add them next time?’, and related to that, but a bit out of scope of the competition, we can also find out the favorite product of client or the potential product that the customer will like.

    Instacart is a grocery odering and delivery app based in US.


    In general, the dataset contains the history of the transnational information of when a user has brought which product.

    Full dataset schema is as followed : source

    orders (3.4m rows, 206k users):

    • order_id: order identifier
    • user_id: customer identifier
    • eval_set: which evaluation set this order belongs in (see SET described below)
    • order_number: the order sequence number for this user (1 = first, n = nth)
    • order_dow: the day of the week the order was placed on
    • order_hour_of_day: the hour of the day the order was placed on
    • days_since_prior: days since the last order, capped at 30 (with NAs for order_number = 1)

    order_products__SET (30m+ rows):

    • order_id: foreign key
    • product_id: foreign key
    • add_to_cart_order: order in which each product was added to cart
    • reordered: 1 if this product has been ordered by this user in the past, 0 otherwise

    where SET is one of the four following evaluation sets (eval_set in orders):

    • "prior": orders prior to that users most recent order (~3.2m orders)
    • "train": training data supplied to participants (~131k orders)
    • "test": test data reserved for machine learning competitions (~75k orders)

    2. The approach

    Since the evaluation metric of the competition is F1-score and the purpose is to predict the item that users will reorder again, my approach is feature engineering + logloss-xgboost model + F1-optimization.

    • Feature engineering (frequency/distribution and interaction based features)

      • Item (Product/aisle/department) features:

        1. number of users has ordered this product
        2. number of users has reordered this product
        3. user reorder ratio of this product : #1 / #2
        4. number of times this product has been ordered
        5. number of times this product has been reordered
        6. product reorder ratio: #4 / #5
        7. product reorder probability : numbers of products has been purchased the second time / numbers of products has been purchased the first time
        8. mean add to cart order of product


        1. number of times this aisle has been ordered
        2. number of times this aisle has been reordered
        3. aisle reorder ratio: #1 / #2
        4. mean add to cart order of aisle


        1. number of times this department has been ordered
        2. number of times this department has been reordered
        3. department reorder ratio: #1 / #2
        4. mean add to cart order of department

      • User features:
        1. number of orders has been placed by the users
        2. how long users have been ordering from website (user tenure)
        3. average orders days since prior orders
        4. number of products in total the user has brought
        5. number of unique products/aisles/departments the user has brought (users taste variance)
        6. average basket size of the user
        7. proportion of the reorder product of users

      • User x Item features:
        1. number of times the user has brought this product
        2. #1/total orders number
        3. first/last order number of this user buying this product
        4. average add to cart order of the product
        5. Total order number - last orders of this product (capture recency)
        6. Is this product has been purchased in the last three order of the user (capture recency)

      This is pretty much the features I’ve generated, among the features, the most important feature selected by the model is the average add to cart order of the product and Is this product has been purchased in the last three order of the user which capture the recency of the behavioral pattern of the users.

      The important finding in this feature engineering part is:

      1) Remove highly correlated features (corr > 95%) before putting the variables into the xgboost models, even though the xgboost models handle perfectly the correlation with the same importance due to the way it splits the features, but by removing the correlated variables, the model could get similar or better performance by choosing only one of them and not to mention the decrease of the computing time.

      2) Kaggler @raddar shared one of his way to examine if the variable is important at one post which I found really useful: “Take new feature, bin it into let’s say 50 attributes. Now take your best model out-of-fold predictions. Then for each bin, calculate mean target rate (if you are using binary model) and mean predicted probability of your classifier. if you see high difference between mean and predicted values in any of attributes (which has significant amount of observations), the feature is likely to reduce the binomial variance if included in the new model.” in two words, for continuous features, split them into 20-50 buckets and measure diversity by the mean of the target.

      3) Add binary attribute feature of the product (gluten-free/Asian food/organic/pet), for example, if a user has a cat/dog, it’s possible that he/she will reorder product for the pet, or if he/she likes Asian food or organic food, it’s more likely that they will more it more frequently than others, etc.

      4) use max/mean and std of the numeric datetime features when it comes to describing the gap between a user and the time he placed an order.

      5) I didn’t implement product text feature in the model, but people who got a good final score have used this feature, in summary here it’s their approach (summary from @SVJ24,@plantsgo,@Arcady27):

      • Product name length
      • Product name rarity: mean of idf of each word in a product (capture people who like exotic products)
      • Product vector: Treat each order as a sentence and product_id as words in that sentence. group by order_id, sentence = each product of this order, product_id as words.


        For example, in order #1, the user brought product 49302, 11109, 10246, 49683, 43633, 13176, etc, so each product_id will be transformed into a string and serve as a word to put into the w2v model to learn their vector representation, and use PCA to reduce the dimension.

      • User vector : Treat each user as a document and order_id as sentence in that document.group by user_id, sentence = each order of this user, product_id/aisle_id/deprtment_id as words. picutre picutre

        Treat every order as sentences and product_id/aisle_id/department_id as words, and use user_id as the document label, so one document is ensemble user orders history and the document id is the user id. at the end, you can get the word-embedding for a user.

      • Product vector × User vector : Use dot product to get the cosine similarity of between the user and product.

      • Product vector × aisle vector : Similarity between product and aisle means how unusual is this product for the aisle.

    • Model + CV framework

      • Model:

        1. I used a xgboost model with logloss as the evaluation metric, the small learning rate is required I supposed, 0.01 seemed to do well among the [0.1,0.05,0.01].
        2. predicting none model: binary model using the same feature but the target to predict the whole order if it contains reordered product(1) or just order is ordered (1) vs not ordered(0)
      • CV:

        I completely failed this steps which cost me my ranking, since I simply split the dataset into 70-30 and then train on the 70% for 3 folds cv, the final result of the competition (from 187th to 326th) proved my model to be overfitting (kinda expected)… As a matter of fact, the dataset should be split by user id since the reordering is targeted by the user, essentially the model has to be accurate at the level of the user, so as the cv framework. In simple words, the same user has to be presented in both train/test fold.

    • Winner-solution: It’s always good to learn from the best, so the links to the winner-solutions are attached:

      1. 2th place solution
      2. 3th place solution
      3. 8th place solution
      4. 9th place solution
      5. 43th place solution

      This competition is also really close to the scope of my current job, by being a part of this competition allows me to deal with the similar problem with new aspect especially in creating new features in a model to catch the behaviors of the client, even though I didn’t gain any models but it certainly gives me useful insights and experiences when it comes to this business problem, Have fun Kaggling!

  • [Kaggle] Quora question pairs summary

    The Quora question pairs competition ended two months ago in kaggle, it was my first serious kaggle competition and as the final result, I got a bronze medal for being in the top 8% position in the scoreboard.

    Frankly speaking, this competition is so fierce that I remembered one Saturday before the competition ended, I was in top 3%, and when I checked on the following Monday I was pushed downed to top 7% already, and there are some really good kaggle masters competing there which really show me the huge gap between my model and theirs. That’s the reason why I decided to write this blog to share some takeaways from this competition and some useful features drawn from the winner solution shared in the discussion board after the competition ended.

    Problem formulation


    The Objective of the competition is to identify the duplicate question on the question dataset by predicting the probability of similarity between two questions.


    This is the training dataset provided by Quora which contains 404290 pairs of questions (question1,question2), along with a column ‘is_duplicate’ with value 0 indicating this pair of question is duplicate and value 1 being non-duplicate. Same goes with the test dataset with 2345796 pairs of questions.


    Ok, it seems to me pretty clear that it is a classification problem, so we need to use the model built from the training dataset to predict if the pairs of questions in test dataset is duplicate or not.

    My approaches: When we talk about building models…

    The first thing to tackle a text mining problem is to turn the word into numeric representation so that the algorithm can ‘understand’ the word. To interpret the text there are usually two methods: 1)TFIDF 2)word2vec. For the TFIDF part I used bigram to get the words weighting and a pre-trained word2vec model for the word vector. My final model is a simple xgb model with 47 features under python 3.5 environment.

    The ‘To try’ part is resumed from the winning solution in the discussion board after the competition ended

    1. Preprocessing (no surprise here..)

    1. Lowercase
    2. Remove question mark
    3. Lemmatization
    4. Snowball Stemmer

    To try:

    1. Spelling correction
    2. Name Entity Recognizer (spaCy)
    3. Use [Textacy](https://github.com/chartbeat-labs/textacy) for preprocessing

    2. Feature generation

    Feature generation plays an important role in text mining, and the features I’ve generated can be categorized as follows:

    1.NLP text features: The descriptive measure of two sentences.

    1.	Length of the question1, question2
    2.	Difference of the length of two questions ( -/+unique words)
    3.	Ratio of the length of two questions( -/+unique words)
    4.	Character-wise length of the question1, question2 (with/without stopwords)
    5.	Character-wise ratio of length of the question1, question2
    6.	Number of words in question1, question2
    7.	Number of common words in question1, question2
    8.	Ratio of number of common words dived by the length of two questions (removed stopwords)
    9.	Ratio of common words tfidf scores dived by the total tfidf scores of two questions 
            (-/+ stopwords)
    10.	Number of total unique words in two questions (with/without stopwords)
    11.	same_start_word/same_last_word : binary features

    To try:

    1.  Longest common subsequences (lcs) of both questions
    2.  POS-based featuress <br><br>

    2.Distance features (mostly inspired by @abhishek, it really guided me into the competition): The distance between two vectors.

    The vectors of words are constructed from pre-trained google news w2v model, I tried to build my own w2v model by using all the questions in both datasets but it didn't outperform the pre-trained one, I guess the reason is quite obvious due to the size of the corpus.
     1. Jaccard similarity
     2. Cosine similarity
     3. Euclidean distance
     4. Minkowski distance
     3. Levenshtein Distance 
        (package *fuzzywuzzy* : string similarity/partial string similarity/token sort)
     4. Word mover's distance

    To try:

     1. Use G Love for vectors embedding (and recompute the similarity)

    3.Graphical features:

    I have absolutely zero knowledge at that time that this kind of features can be applied to an NLP competition which turns out to be an extremely important feature (thanks to Shankara Srivastava’s idea of PageRank feature). In the social network analysis, you can see a lot of this kind of graphical analysis as well.


    Basically, every question is node in the graph, the degree of node here can be seen as the frequency of this question, and the common neighbors for the two nodes.

    To try:

      1.maxclique features of edges
      2.maxclique features of nodes
      3.unique neighbors
      4.paths of length n between q1 & q2

    3. Imbalanced dataset

    In the training dataset, since the percentage of duplicate question is 36.9%, it’s important to rebalance the dataset, and there are lots of methods to deal with this problem i.e. oversample (oversample the minority class), undersample (undersampling the majority class). But instead of using the two sampling methods above, I used the class weight in the xgb Matrix argument to let the model treat two class differently. It turned out to be quite effective, which helped me at least with 0,02 improvement.

    4. Conclusion

    I should definitely explore more in the graphical features since it proved to reveal efficiently the connection between the two questions, and also create more features based on different pre-processed step (raw, stemmed, cleaned, stopwords only, stopwords removed, etc).

    I spent almost all the weekends of this two months in this competition, Learned a lot from the great people there, let’s just say besides learning the theory in data science, being able to apply it is also critical, I think Kaggle is good platform to bridge the gap between this two aspects. Have fun kaggling!