Skip to content

Programming Collective Intelligence 读书总结

  • Making Recommendations (Collaborative Filtering)
    • User-based
      • Finding similar users
        • User as vector based on item score
          • Euclidean distance
          • Pearson correlation
        • Reverse users and items, we can find similar items to a given item
      • Sort and recommend items based on
        • sum(user similarity * user’s item score) for each other user
    • Item-based
      • Find item similarities
        • These results can be cached and periodically updated
      • Sort and recommend items based on
        • sum((item similarity * user’s item score) / sum(item similarity)) for each user’s item
      • Significantly faster and better for sparse dataset
  • Discovering Groups (Clustering)
    • Supervised Learning
      • use example inputs and outputs
      • neural networks, decision trees, support-vector machines, and Bayesian filtering
    • Word Vectors of texts
    • Hierarchical Clustering
      • choose two nearest vectors to combine
      • results in binary tree
    • Can cluster articles or words
      • transpose the matrix
    • Dendrogram drawing
    • K-Means clustering
      • randomly place k centroids
      • assign every item to the nearest centroid, and move the centroid to the average location of all items assigned to them
  • Searching and Ranking
    • word index stored in relational database
    • ranking
      • content-based
        • various metrics: word frequency, document location, word distance
      • use inbound links
        • simple count
        • PageRank algorithm
          • random walk
          • sparse matrix multiplication iterations
        • use link text
      • learning from clicks
        • click-tracking neuro-network (multilayer perception network, i.e. MLP network)
          • one hidden layer
  • Optimization
    • stochastic optimization
      • numerical solution
      • cost function
    • random searching
    • hill climbing
      • increase the most promising dimension of a vector
    • simulated annealing
      • variable: temperature, starts very high and gradually gets lower
      • worse solution being accepted depending on temperature
    • generic algorithms
      • mutate, crossover, …
  • Document Filtering (to be expanded…)
    • use words as features
    • naive Bayesian classifier
    • the Fisher method
  • Modeling with Decision Trees
    • Algorithm: CART (Classification and Regression Trees)
      • choose the best split from all possible splits
        • Gini impurity
        • information entropy
          • sum of p(x)log(p(x))
      • recursively build the whole tree
      • then can be used to classify new observations
      • pruning the tree
        • when it becomes overfitted
        • checking pairs of nodes that have a common parent to see if merging them would increase the entropy by less than a specified threshold
    • Dealing with
      • missing data
        • use both branches
      • numerical outcomes
        • use variance instead of entropy
  • Building Price Models
    • k-nearest neighbors (kNN)
      • weighted
      • may need scaling or normalizing
      • to estimate the probability density
    • cross-validation
      • divide data into training sets and test sets
  • Advanced Classification: Kernel Methods and SVMs
    • basic linear classification
      • using dot-products to determine distance
    • kernel methods
      • define another dot-product == move the points into different space
    • support-vector machines
      • find the line that is as far away as possible from classes
  • Finding Independent Features
    • non-negative matrix factorization
      • factor the article-word matrix into two matrix
        • the features matrix: row for features, column for words
        • the weight matrix: row for articles, column for features
  • Evolving Intelligence
    • creating an algorithm that creating algorithms
    • mutation, crossover/breeding
    • use trees to represent algorithm to enable evolving
      • use to guess numerical functions or, game AI
  • Algorithm Summary
    • Supervised Learning
      • Bayesian Classifier
      • Decision Tree Classifier
      • Neural Networks
      • Support-Vector Machines
    • Unsupervised Learning
      • k-Nearest Neighbors
      • Clustering
      • Multidimensional Scaling
      • Non-Negative Matrix Factorization
    • Optimization

2 Comments

  1. kyle wrote:

    上面的模型,博主认为哪一个推荐率好一些

    Monday, April 7, 2014 at 12:30 | Permalink
  2. Hello there! Do you know if they make any plugins to safeguard against hackers?
    I’m kinda paranoid about losing everything I’ve worked hard on. Any
    recommendations?

    Friday, May 23, 2014 at 18:29 | Permalink

Post a Comment

Your email is never published nor shared. Required fields are marked *
*
*