- 输入层input layer:数据的属性
- hidden layer
- output layer
- 每一层有很多单元,单元之间有连接
- The network is feed-forward: None of the weights cycles back to an input unit
- network perform nonlinear regression: given enough hidden units and enough training samples, they can closely approximate any function.
- Define a Network Topology
- #of each layer
- normalize the input values to [0,1]
- 输入层的单元数量由输入数据的维度来确定
- 输出层的单元数目由分类的种类来确定,每个单元的输出结果就是0 or 1
- Once the network has been trained and its accuracy is unacceptable,
- a different set of initial weights [一开始设成随机较小的数字, associated with biases]
- a different network topology
- 层之间的连接一般采用全连接
- 根据误差向后反馈去调整权重,朝正确的结果前进。数据向前传播,而反馈向后传播
- modified —>minimize the mean squared error
- back propagate the error
- propagate the data
- 直到权重的调整幅度较小时,便可以停止了
- wait 权重如何调整
- weakness and strength
- efficiency and interpretability
- 每次迭代都遍历整个数据集,学习时间长
- 决策树的学习结果可以转化为人可以理解的规则
- 但是神经网络的学习结果只是一些权重,从权重到规则是很困难的一件事。如果一个模型人们能够理解,人们会更愿意去学习它。
- 但是神经网络的精确度还是不错的
- 对噪音的容忍度比较高,其学习结果比较稳定
- well-suited for continuous-valued nuts and outputs数据可以是连续的,both输入&输出,所以神经网络不仅可以用作分类,也可以用作预测
- 有并行的基础
支持向量机 Support Vector Machines
1990+提出来的,分类精度比较高,有很好的数学基础,不足是训练时间比较长
- classification: a mathematical mapping
- linear classification: 线 面 超平面;
- a relative new classification machines for both linear and non-linear data
- hyperplane decision boundary
- 把非线性的数据映射到高维的线性 model complex nonlinear decision boundary
- When data is linearly separable
- maximum marginal hyperplane: we want to find to best line/hyperplane to minimize classification error on unseen data
- 最终的结果只和support vector 有关:support vector really matter,所以学习的过程虽然很慢,但是学习的成果用作分类时很快
Above are eager learning
Lazy learning:
K-Nearest Neighbor Algorithm
假设:空间上接近的数据,其分类的结果也相似
如果要快速找到k个邻居,需要做一个索引来帮助快速时间。
在空间数据库中,找R树,降低到log n数量级
Vonoroi diagram: the decision surface deduced by 1NN
For real-valued prediction
distance-weighted nearest
以上是分类
Cluster Analysis
- feature:
- similar to one another within the same group
- find similarities between data objects then group similar data objects into cluster
- unsupervised learning
- applications
- stand-alone tool to get insight into data distribution
- preprocessing step for other algorithms [将非常大的数据量进行细化]
- quality
- cohesive
- distinctive
- depends on:
- similarity measure
- implementation
- ability to find the hidden patterns
- measure the quality of clustering
- the distance functions are usually rather different for interval -scaled
- considerations for cluster analysis
- single level—>multi-level hierarchical partitioning is desirable
- exclusive —>non-exclusive one document may belong to more than one class
- distance-based (road network)—>connectivity-based
- clustering space: full space—>subspaces (high-dimensional clustering )
- scalability:
- ability to deal with different types of attributes
- constraint-based clustering
- insensitivity to input order
- discover of clusters with arbitrary shape [ because of the definition of distance]
- challenges: high dimensionality
- major approaches:
- 划分、层次、基于密度(可以发现任意形状的聚类,based on the connectivity and density)、基于网格的、基于模型的、基于频繁模式的、 user-guided or constraint-based、link-based clustering
10.22
k-means algorithm
- arbitrarily partition objects into k nonempty subsets
- centroids of the clusters of the current partitioning
- assign each object to the cluster with the nearest seed
- go back to step 2—>update centroids
strength:
- 在迭代次数 类的个数都不大的情况下,基本上可以看作是一个线性的算法
weakness:
- 对属性值有要求:applicable only to objects in a continuous n-dimensional space
- in comparison, k-medoids can be applied to a wide range of data
- sensitive to noisy data and outliers [distort distribution]
- need to specify k
- bit suitable to discover clusters with non-convex shapes [由于采用的是距离,所以分类的结果基本上都是圆形,长条形的分类结果不容易被察觉]
k-Medoids Algorithm
- randomly select a non-medoid object
- compute total cost of swapping [选一个新的medoid,看看目标函数——距离的平方和有没有下降,如果有的话,那就改变 medoid,每一步都是随机在尝试]
similarity:
每次迭代时都重新归类
Hierarchical Clustering
This method doesn’t require the number of clusters k, but needs a termination condition or there is one cluster in the end.
类和类之间的距离如何计算:
- 计算两类点之间最短的距离
- 两类点之间的距离的平方和
- 既有点和点之间的距离,又有点和类之间的距离,还有类和类之间的距离
- termination conditions such as when the minimum distance between two clusters go over one threshold.
- 此种思想既可以正着来,也可以倒着来:divisive or agglomerative
- dendrogram: shows how clusters are merged
- cutting the dendrogram at the desired level, then each connected component forms a cluster
- distance between two clusters:
- single link
- complete link
- average
- centroid
- medoid: distance between the medoids of two clusters
Density-Based Clustering Methods
major features:
- discover clusters of arbitrary shape
- handle noise
- one scan
- need density parameters as termination conditions
Two parameters:
- Eps: Maximum radius of the neighborhood
- MinPts: Minium number of points in an Eps-neighbouthood of that point
- many applications:
text documents, DNA microarray
- challenges:
- irrelevant dimensions
- distance measure becomes meaningless:
- traditional distance measure could be dominated by nosies in many dimensions [eg:没有共同点的两个数据之间的距离和有共同点也有不同点的两个数据之间的距离在欧几里德空间是一样的,但是直觉上肯定是后者更为类似。]
- clusters may exist only in some subspaces
- Methods:
- subspace-clustering :clusters mat exist only in some subspaces
- dimensionality reduction approaches
Bi-clustering:
Cluster both objects and attributes
10.27日 推荐系统
高维的数据
聚类
#什么是推荐系统#
给一个未知的用户推荐他所感兴趣的事
Web enables near-zero-cost dissemination of information about products.——from scarcity to abundance
More choice necessitates better filters
eg: How Into Thin Air made Touching the Void
#Sidenote: The Long Tail#
#Formal Model#
Utility function u: (X,S)—>R
Utility Matrix
Get to know the
Extrapolate unknown ratings from the known ones
#key problem#
- Utility matrix U is sparse
most customers have not rated most items
- cool start: new items have no ratings& new customers have no history
#three approches to recommender system#
- content-based
movie recommendations
- collaborative
- latent factor based(analyze the utility—特征值和特征向量)
#content-based approach #
For each item, create an item profile. Profile is a set of features(e.g., movies include author title actor and director)
用向量来描述每个item
从item profile 到user profile
- weighted average of rated item profiles
- variation: weight by difference from average rating for item
prediction heuristic—向量夹角
Features
- no need for data on other users
- able to recommend to users with unique tastes
- able to recommend new and unpopular items(协同推荐偏向于推荐大家都买的畅销物品)
- able to provide explanations
- -finding the appropriate features is hard
- -recommendations for new users how to build a user profile
- -overspecialization(若用户本身数据较为单一的话,其推荐内容也停留在非常单一的程度上)
#Collaborative Filtering#
- 如何发现相似的用户(聚类可以,看看其他的处理方法)
- vector of user A’s ratings
- Jaccard similarity measure(只判断存在性,不管权重等,其实就是集合的相似性)
- cosine similarity measure
- solution--subtract the row mean (低分与高分之间的相乘结果求和竟然还大于高分与高分之间相乘的结果,只是因为后者共同出现的次数比较少)
- Pearson correlation coefficient
- Term frequency Inverse Doc Frequency
Evaluating the extrapolation
10.29日 图的算法
We often think of networks being organized into modules, cluster, communities.
Find micro-markets by partitioning the query to advertiser
- Edge Betweenness:number of shortest paths passing over the edge, 评价的是桥梁的繁忙度。社区时间的桥梁的中介度往往是最高的。中介度高的边往往链接的两个社区,那么久可以把这条边给去掉。去掉后就可以很明显地看到这个图就被分成了两个社区。[理想状态下,哈哈]去掉边后还联通的结点就可以被看作是属于同一个社区的。
计算图的最短路径,有多少条的最短路径是经过这条边的,考察的是任意的a到b这对结点的最短路近,如果这对节点有3条最短路径,只有一条最短路径是经过这个edge的话,那么就是1/3.
- Girvan-Newman Algorithm
- undirected unweighted graph
- 有点像聚类的层次算法
- Hierarchical decomposition
- How to compute Betweenness
- want to compute betweenness of paths starting at node A
- Breath first search starting from A, 以A为根进行排列,分为与A差1,2,3步的结点,只有纵向的边才会起作用
- Count the number of shortest paths from A
- Compute betweenness by working up the tree: If there are将结点的贡献转化到到达结点的边上。自己肯定是要过桥的,再看看有没有过了这座桥再往下走的点
- 由于数据量非常大,所以一般会用随机抽样的方法来进行计算,通常抽千分之一的比例去进行遍历,再根据这个抽样得出的中介度去划分社区
- Communities:sets of tightly connected nodes
- Define Modularity Q 需要有一个目标函数(实际边的数目和)
- Null Model: Configuration Model
- Given real G on n nodes and m edges, construct rewired network G’
- same degree distribution but random connections, consider G as a multigraph.
- The expected degree
- 现实含义:实际上边的个数比随机边的个数的期望要多,这样它才是在一个社区中
- modularity values take range[-1,1]
- modularity is useful for selecting the number of clusters
以上是和中介度有关的算法
- Graph Partitioning
- how can we define a “good” partition of G?
- How can we efficiently define such a partition?
- Graph Cut Criterion
- optimal cut sometimes isn’t minimum cut
- normalized-cut
- vol(A): total weight of the edges with at least one endpoint in A
- 不过这是一个NP问题
- Spectral Graph Partitioning
- A: adjacency matrix of undirected G
- Trawling
- 二部图
- Intuition: Many people all talking about the same things
-
- 比较大的社交网站往往有大于10亿个结点,