csatblogspotdotcom

Tuesday, January 10, 2017

(转)An introduction to neural networks 神经网络介绍 - IBM

英文原文:
https://www.ibm.com/developerworks/library/l-neural/
中文:
https://www.ibm.com/developerworks/cn/linux/other/l-neural/
文中代码示例:
http://gnosis.cx/download/neural_net_1.zip
而此示例是基于如下代码:
http://www.enme.ucalgary.ca/~nascheme/python/bpnn.py
(bpnn: BackProp Neural Net)

神经网络介绍

利用反向传播算法的模式学习
神经网络也许是计算机计算的将来,一个了解它的好方法是用一个它可以解决的难题来说明。假设给出 500 个字符的代码段,您知道它们是 C、C++、Java 或者 Python。现在构造一个程序,来识别编写这段代码的语言。一种解决方案是构造一个能够学习识别这些语言的神经网络。这篇文章讨论了神经网络的基本功能以及构造神经网络的方法,这样就可以在编码时应用它们了。
David Mertz, Ph.D. (mertz@gnosis.cx)Gnosis Software, Inc.
2001 年 6 月 01 日
根据一个简化的统计,人脑由百亿条神经组成 ― 每条神经平均连结到其它几千条神经。通过这种连结方式,神经可以收发不同数量的能量。神经的一个非常重要的功能是它们对能量的接受并不是立即作出响应,而是将它们累加起来,当这个累加的总和达到某个临界阈值时,它们将它们自己的那部分能量发送给其它的神经。大脑通过调节这些连结的数目和强度进行学习。尽管这是个生物行为的简化描述。但同样可以充分有力地被看作是神经网络的模型。

阈值逻辑单元(Threshold Logic Unit,TLU)

理解神经网络的第一步是从对抽象生物神经开始,并把重点放在 阈值逻辑单元(TLU)这一特征上。一个 TLU 是一个对象,它可以输入一组加权系数的量,对它们进行求和,如果这个和达到或者超过了某个阈值,输出一个量。 让我们用符号标注这些功能,首先,有输入值以及它们的权系数:X 1, X 2, ..., X n和 W 1, W 2, ..., W n。接着是求和计算出的 X i*W i ,产生了激发层 a,换一种方法表示:
a = (X1 * W1)+(X2 * W2)+...+(Xi * Wi)+...+ (Xn * Wn)
阈值称为 theta。最后,输出结果 y。当 a >=theta 时 y=1,反之 y=0。请注意输出可以是连续的,因为它也可以由一个 squash 函数 s(或 sigma)判定,该函数的自变量是 a,函数值在 0 和 1 之间,y=s(a)。
图 1. 阈值逻辑单元,带有 sigma 函数(顶部)和 cutoff 函数(底部)
阈值逻辑单元
TLU 会分类,假设一个 TLU 有两个输入值,它们的权系数等于 1,theta 值等于 1.5。当这个 TLU 输入 <0>、<0>、<1> 和 <1> 时,它的输出分别为 0、0、0、1。TLU 将这些输入分为两组:0 组和 1 组。就像懂得逻辑连接(布尔运算 AND)的人脑可以类似地将逻辑连接的句子分类那样,TLU 也懂得一点逻辑连接之类的东西。
TLU 能够用几何学上的解释来阐明这种现象。它的四种可能输入对应于笛卡尔图的四个点。从等式 X 1*W 1+ X 2*W 2 = theta,换句话说,也即 TLU 转换其分类行为的点开始,它的点都分布在曲线 X 2 = -X 1 + 1.5 上。这个方程的曲线将 4 个可能的输入分成了两个对应于 TLU 分类的区域。这是 TLU 原理中更为普通的实例。在 TLU 有任意数目的 N 个输入的情况下,一组可能的输入对应于 N 维空间中的一个点集。如果这些点可以被超平面 ― 换句话说,对应于上面示例中的线的 N 维的几何外形切割,那么就有一组权系数和一个阈值来定义其分类刚好与这个切割相匹配的 TLU。

TLU 的学习原理

既然 TLU 懂得分类,它们就知道素材。神经网络也可假定为可以学习。它们的学习机制是模仿大脑调节神经连结的原理。TLU 通过改变它的权系数和阈值来学习。实际上,从数学的观点看,权系数阈值的特征有点武断。让我们回想一下当 SUM(Xi * Wi) >= theta 时 TLU 在临界点时输出的是 1 而不是 0,这相当于说临界点是出现在 SUM(X i* W i)+ (-1 * theta) >= 0 的时候。所以,我们可以把 -1 看成一个常量输入,它的权系数 theta 在学习(或者用技术术语,称为 培训)的过程中进行调整。这样,当 SUM(X i* W i)+ (-1 * theta) >= 0 时,y=1,反之 y=0。
在培训过程中,神经网络输入:
  1. 一系列需要分类的术语示例
  2. 它们的正确分类或者目标
这样的输入可以看成一个向量:1
, X 2, ..., X n, theta, t>,这里 t 是一个目标或者正确分类。神经网络用这些来调整权系数,其目的使培训中的目标与其分类相匹配。更确切地说,这是有指导的培训,与之相反的是无指导的培训。前者是基于带目标的示例,而后者却只是建立在统计分析的基础上(请参阅本文随后的 参考资料)。权系数的调整有一个学习规则,一个理想化的学习算法如下所示:
清单 1. 理想化的学习算法
fully_trained = FALSE
DO UNTIL (fully_trained):
    fully_trained = TRUE
    FOR EACH training_vector = ::
                               # Weights compared to theta
        a = (X1 * W1)+(X2 * W2)+...+(Xn * Wn) - theta
        y = sigma(a)
        IF y != target:
            fully_trained = FALSE
        FOR EACH Wi:
        MODIFY_WEIGHT(Wi)      # According to the training rule
    IF (fully_trained):
        BREAK
您或许想知道,“哪些培训规则?”有很多,不过有一条似乎合理的规则是基于这样一种思想,即权系数和阈值的调整应该由分式 (t - y) 确定。这个规则通过引入 alpha (0 < alpha < 1) 完成。我们把 alpha 称为 学习率。W i 中的更改值等于 (alpha * (t - y)* Xi)。当 alpha 趋向于 0 时,神经网络的权系数的调整变得保守一点;当 alpha 趋向于 1 时,权系数的调整变得激进。一个使用这个规则的神经网络称为 感知器,并且这个规则被称为 感知器学习规则。Rosenblatt 于 1962 年(请参阅 参考资料)下的结论是,如果 N 维空间的点集被超平面切割,那么感知器的培训算法的应用将会最终导致权系数的分配,从而定义了一个 TLU,它的超平面会进行需要的分割。当然,为了记起 Keynes,最终我们都切断了与外界的联系,专心思考。但是在计算时间之外,我们仍濒临危险,因为我们需要自己的神经网络对可能输入的空间进行不止一次的切割。
文章开始的难题举例说明了这个,假设给您 N 个字符的代码段,您知道是 C、C++、Java 或者 Python。难的是构造一个程序来标识编写这段代码的语言。用 TLU 来实现需要对可能的输入空间进行不止一次的分割。它需要把空间分成四个区域。每种语言一个区域。把神经网络培训成能实现两个切割就可完成这种工作。第一个切割将 C/C++ 和 Java/Python 分开来,另一个将 C/Java 和 C++/Python 分开。一个能够完成这些切割的网络同样可以识别源代码样本中的语言。但是这需要网络有不同结构,在描述这个不同之处之前,先来简单地看一下实践方面的考虑。
图 2. 初步的(不完整的)感知器学习模型感知器学习模型
考虑到排除取得 N 个字符代码所需的计算时间,统计从 ASCII 码的 32 到 127 的范围内可视 ASCII 码字符出现的频率,并在这个统计以及关于代码语言的目标信息的基础上培训神经网络。我们的方法是将字符统计限制到 C、C++、Java 和 Python 代码字符库中最常用的 20 个非字母数字字符。由于关注浮点运算的执行,我们打算用一种规格化因素将这 20 字符统计分开来,并以此培训我们的网络。显然,一个结构上的不同是我们的网络有 20 个输入节点,但这是很正常的,因为我们的描述已经暗示了这种可能性。一个更有意思的区别是出现了一对中间节点,N1 和 N2,以及输出节点数量从两个变成了四个(O1 到 O4)。
我们将培训 N1,这样当它一看到 C 或 C++,设置 y1=1,看到 Java 或 Python,它将设置 y1=0。同理培训 N2,当它一看到 C 或 Java,设置 y2=1,看到 C++ 或 Python,设置 y2=0。此外,N1 和 N2 将输出 1 或 0 给 Oi。现在如果 N1 看到 C 或 C++,而且 N2 看到 C 或者 Java,那么难题中的代码是 C。而如果 N1 看到 C 或 C++,N2 看到 C++ 或 Python,那么代码就是 C++。这个模式很显而易见。所以假设 Oi 已被培训并根据下面表格的情况输出 1 或 0。
映射到输出(作为布尔函数)的中间节点
N1N2O1 (C)O2 (C++)O3 (Java)O4 (Python)
000001
010010
100100
111000
如果这样可行的话,我们的网络就可以从代码示例中识别出语言了。这个想法很好。但是在实践上却有些难以置信。不过这种解决方案预示了 C/C++ 和 Java/Python 输入被一个超平面切割了,同样 C/Java 和 C++/Python 输入被另一个切割。这是一个网络培训的解决方案,迂回地解决了这个输入空间的设想。

关于 delta 规则

另一种培训的规则叫做 delta 规则。感知器培训规则是基于这样一种思路 ― 权系数的调整是由目标和输出的差分方程表达式决定。而 delta 规则是基于梯度降落这样一种思路。这个复杂的数学概念可以举个简单的例子来表示。从给定的几点来看,向南的那条路径比向东那条更陡些。向东就像从悬崖上掉下来,但是向南就是沿着一个略微倾斜的斜坡下来,向西像登一座陡峭的山,而北边则到了平地,只要慢慢的闲逛就可以了。所以您要寻找的是到达平地的所有路径中将陡峭的总和减少到最小的路径。在权系数的调整中,神经网络将会找到一种将误差减少到最小的权系数的分配方式。
将我们的网络限制为没有隐藏节点,但是可能会有不止一个的输出节点,设 p 是一组培训中的一个元素,t(p,n) 是相应的输出节点 n 的目标。但是,设 y(p,n) 由以上提到的 squash 函数 s 决定,这里 a(p,n) 是与 p 相关的 n 的激活函数,或者用 (p,n) = s( a(p,n) ) 表示为与 p 相关的节点 n 的 squash 过的激活函数。为网络设定权系数(每个 Wi),也为每个 p 和 n 建立 t(p,n) 与 y(p,n) 的差分,这就意味着为每个 p 设定了网络全部的误差。因此对于每组权系数来说有一个平均误差。但是 delta 规则取决于求平均值方法的精确度以及误差。我们先不讨论细节问题,只是说一些与某些 p 和 n 相关的误差:?* square( t(p,n) - y(p,n) )(请参阅 参考资料)。现在,对于每个 Wi,平均误差定义如下:
清单 2. 查找平均误差
sum = 0
FOR p = 1 TO M:         # M is number of training vectors
    FOR n = 1 TO N:     # N is number of output nodes
        sum = sum + (1/2 * (t(p,n)-y(p,n))^2)
average = 1/M * sum
delta 规则就是依据这个误差的定义来定义的。因为误差是依据那些培训向量来说明的,delta 规则是一种获取一个特殊的权系数集以及一个特殊的向量的算法。而改变权系数将会使神经网络的误差最小化。我们不需要讨论支持这个算法的微积分学,只要认为任何 Wi 发生的变化都是如下所示就够了:
alpha * s'(a(p,n)) * (t(p,n) - y(p,n)) * X(p,i,n).
X(p,i,n) 是输入到节点 n 的 p 中的第 i 个元素,alpha 是已知的学习率。最后 s'( a(p,n) ) 是与 p 相关的第 n 个节点激活的 squashing 函数的变化(派生)率,这就是 delta 规则,并且 Widrow 和 Stearns (请参阅 参考资料)向我们展示了当 alpha 非常小的时候,权系数向量接近某个将误差最小化的向量。用于权系数调节的基于 delta 规则的算法就是如此。
梯度降落(直到误差小到适当的程度为止)
step 1: for each training vector, p, find a(p)
step 2: for each i, change Wi by:
            alpha * s'(a(p,n)) * (t(p,n)-y(p,n)) * X(p,i,n)
这里有一些与感知器算法相区别的重要不同点。显然,在权系数调整的公式下有着完全不同的分析。delta 规则算法总是在权系数上调整,而且这是建立在相对输出的激活方式上。最后,这不一定适用于存在隐藏节点的网络。

反向传播

反向传播这一算法把支持 delta 规则的分析扩展到了带有隐藏节点的神经网络。为了理解这个问题,设想 Bob 给 Alice 讲了一个故事,然后 Alice 又讲给了 Ted,Ted 检查了这个事实真相,发现这个故事是错误的。现在 Ted 需要找出哪些错误是 Bob 造成的而哪些又归咎于 Alice。当输出节点从隐藏节点获得输入,网络发现出现了误差,权系数的调整需要一个算法来找出整个误差是由多少不同的节点造成的,网络需要问,“是谁让我误入歧途?到怎样的程度?如何弥补?”这时,网络该怎么做呢?
图 3:“代码识别”反向传播的神经网络
图 3:“代码识别”反向传播的神经网络
反向传播算法同样来源于梯度降落原理,在权系数调整分析中的唯一不同是涉及到 t(p,n) 与 y(p,n) 的差分。通常来说 W i的改变在于:
alpha * s'(a(p,n)) * d(n) * X(p,i,n)
其中 d(n) 是隐藏节点 n 的函数,让我们来看(1)n 对任何给出的输出节点有多大影响;(2)输出节点本身对网络整体的误差有多少影响。一方面,n 影响一个输出节点越多,n 造成网络整体的误差也越多。另一方面,如果输出节点影响网络整体的误差越少,n 对输出节点的影响也相应减少。这里 d(j) 是对网络的整体误差的基值,W(n,j) 是 n 对 j 造成的影响,d(j) * W(n,j) 是这两种影响的总和。但是 n 几乎总是影响多个输出节点,也许会影响每一个输出结点,这样,d(n) 可以表示为:
SUM(d(j)*W(n,j))
这里 j 是一个从 n 获得输入的输出节点,联系起来,我们就得到了一个培训规则,第 1 部分:在隐藏节点 n 和输出节点 j 之间权系数改变,如下所示:
alpha * s'(a(p,n))*(t(p,n) - y(p,n)) * X(p,n,j)
第 2 部分:在输入节点 i 和输出节点 n 之间权系数改变,如下所示:
alpha * s'(a(p,n)) * sum(d(j) * W(n,j)) * X(p,i,n)
这里每个从 n 接收输入的输出节点 j 都不同。关于反向传播算法的基本情况大致如此。
将 Wi 初始化为小的随机值。

使误差小到适当的程度要遵循的步骤

第 1 步:输入培训向量。
第 2 步:隐藏节点计算它们的输出
第 3 步:输出节点在第 2 步的基础上计算它们的输出。
第 4 步:计算第 3 步所得的结果和期望值之间的差。
第 5 步:把第 4 步的结果填入培训规则的第 1 部分。
第 6 步:对于每个隐藏节点 n,计算 d(n)。
第 7 步:把第 6 步的结果填入培训规则的第 2 部分。
通常把第 1 步到第 3 步称为 正向传播,把第 4 步到第 7 步称为 反向传播。反向传播的名字由此而来。

识别成功

在掌握了反向传播算法后,可以来看我们的识别源代码样本语言的难题。为了解决这个问题,我们提供了 Neil Schemenauer 的 Python 模型bpnn(请参阅 参考资料)。用它的模型解决问题真是难以置信的简单,在我们的类 NN2 里定制了一个类 NN ,不过我们的改变只是调整了表达方式和整个过程的输出,并没有涉及到算法。基本的代码如下所示:
清单 3:用 bpnn.py 建立一个神经网络
# Create the network (number of input, hidden, and training nodes)
net = NN2(INPUTS, HIDDEN, OUTPUTS) 
        # create the training and testing data
trainpat = []  
testpat = []  
for n in xrange(TRAINSIZE+TESTSIZE):  
    
        #... add vectors to each set
# train it with some patterns 
net.train(trainpat, iterations=ITERATIONS, N=LEARNRATE, M=MOMENTUM)  
        # test it 
net.test(testpat)  
        # report trained weights 
net.weights()
当然我们需要输入数据,实用程序 code2data.py 提供了这个功能。它的界面很直观:只要将一堆扩展名各不相同的文件放到一个子目录 ./code 中,然后运行这个实用程序,并列举那些扩展名作为命令选项。例如:
python code2data.py py c java
您得到的是一堆 STDOUT 上的向量,可以把这些向量输入到另一个进程或者重定向到一个文件,它的输出如下所示:
清单 4:Code2Data 的输出向量
0.15 0.01 0.01 0.04 0.07 0.00 0.00 0.03 0.01 0.00 0.00 0.00 0.05 0.00 > 1 0 0
0.14 0.00 0.00 0.05 0.13 0.00 0.00 0.00 0.02 0.00 0.00 0.00 0.13 0.00 > 1 0 0
[...]
让我们回忆一下输入值都是不同特殊字符出现的规格化数目,目标值(在大于号以后)是 YES/NO,它代表包含这些字符的源代码文件的类型,不过对于什么是什么来说,并没有非常明显的东西。数字可以是输入或期望的 任意值,这才是最重要的。
下一步是运行实际的 code_recognizer.py 程序。这需要(在 STDIN 中)像上面一样的向量集。这个程序有一个包,它能够根据实际文件推断出需要多少输入节点(计算在内的和期望的),选择隐藏节点的数目是一个诀窍。对于源代码的识别,6 到 8 个隐藏节点似乎工作得很好。如果打算试验网络从而发现对于这些不同的选项它是如何做的,您可以覆盖命令行中的所有参数,但每一次运行还是会耗费一些时间。值得注意的是, code_recognizer.py 将它的(大的)测试结果文件发送到 STDOUT,而将一些友好的消息放在 STDERR 里。这样在大部分时间里,为了安全保管,您将会把 STDOUT 定向到一个文件,并监视针对进程和结果概要的 STDERR。
清单 5:运行 code_recognizer.py
> code2data.py py c java | code_recognizer.py > test_results.txt 
Total bytes of py-source: 457729 
Total bytes of c-source: 245197 
Total bytes of java-source: 709858 
Input set: ) ( _ . = ; " , ' * / { } : - 0 + 1 [ ] 
HIDDEN = 8 
LEARNRATE = 0.5 
ITERATIONS = 1000 
TRAINSIZE = 500 
OUTPUTS = 3 
MOMENTUM = 0.1 
ERROR_CUTOFF = 0.01 
TESTSIZE = 500 
INPUTS = 20 
error -> 95.519... 23.696... 19.727... 14.012... 11.058... 9.652...  
8.858... 8.236... 7.637... 7.065... 6.398... 5.413... 4.508...  
3.860... 3.523... 3.258... 3.026... 2.818... 2.631... 2.463...  
2.313... 2.180... 2.065... 1.965... 1.877... 1.798... 1.725...  
[...] 
0.113... 0.110... 0.108... 0.106... 0.104... 0.102... 0.100...  
0.098... 0.096... 0.094... 0.093... 0.091... 0.089... 0.088...  
0.086... 0.085... 0.084... 
Success rate against test data: 92.60%
不断减少误差是个很好的兆头,这至少在一段长时间里所获得的一种进步,且最后的结果必然是深入人心的。就我们的观点而言,网络完成了一项值得尊敬的工作,来识别代码 ― 我们将会乐意倾听,对于您的数字向量它是如何做的。

总结

本文从某种程度上阐述了神经网络的基础,使您能够开始在您自己的编码过程中应用它们。我们鼓励您运用在这里学到的东西,并尝试编写您自己的对于这个难题的解决方案。(请参阅于我们的解决方案的 参考资料)。

参考资料

  • 您可以参阅本文在 developerWorks 全球站点上的 英文原文.
  • 我们的 代码识别程序是基于 Neil Schemenauer 的反向传播模块。
  • 关于有指导培训和无指导培训的差异,以及神经网络的一般介绍,请参阅由 D. Michie、D.J. Spiegelhalter 以及 C.C. Taylor 编辑的 Machine Learning, Neural and Statistical Classification,具体内容请参阅 第 6 章
  • 关于 Rosenblatt 的感知器结果,请参阅他的 Principles of Neurodynamics, 1962, New York: Spartan Books。
  • 关于一些 delta 规则的详细信息,请参阅 Kevin Gurney 的著作 An Introduction to Neural Networks 1997, London: Routledge。也可以参阅 Neural Nets早期的在线版本。
  • 关于 delta 规则的证明,请参阅 B. Widrow 和 S.D. Stearns 的 Adaptive Signal Processing, 1985, New Jersey: Prentice-Hall。
  • 关于包含图形界面的感知器的执行,请参阅 Omri Weisman 和 Ziv Pollack 撰写的 The Perceptron 。
  • 什么是没有 FAQ 的科目?请参阅在任何时候都适用的 Neural Net FAQ
  • 关于广泛的链接集合,请参阅 The Backpropagator's Review
  • 请参阅 Neural Networks at your Fingertips,它讨论了一组 C 程序包,这些包说明了 Adaline 网络、反向传播、Hopfield 模型以及其它,有一个 The Back-propagation Network 特别有趣,它是一个 C 程序包,说明了一个分析日斑数据的网络。
  • 在 Neural Networking Software,您将会找到有友好的图形界面的同时支持 DOS 和 Linux 的神经网络代码。
  • NEURObjects 提供了开发神经网络的 C++ 的库文件,它的优点在于面向对象。
  • Stuttgart Neural Network Simulator(SNNS),正如名字所示,是用有 GUI 的 C 程序编写的,它的手册内容极为丰富,同时支持友好的 Linux 平台。

条评论

请 登录 或 注册 后发表评论。
注意:评论中不支持 HTML 语法

剩余 1000 字符
 共有评论 (2)
推荐一款最易用中文excel神经网络软件包,有BP、RBF、PNN、GRNN、Elman、SOM、LVQ七种神经网络功能。一键操作,易用易懂。http://www.plsexcelword.com/。
由 大山畅通 于 2016年01月08日发布
这篇文章写的内容应该不错,但是翻译成的中文太垃圾了。
由 Esmool 于 2012年11月03日发布




An introduction to neural networks

Pattern learning with the back-propagation algorithm
Neural nets may be the future of computing. A good way to understand them is with a puzzle that neural nets can be used to solve. Suppose that you are given 500 characters of code that you know to be C, C++, Java, or Python. Now, construct a program that identifies the code's language. One solution is to construct a neural net that learns to identify these languages. This article discusses the basic features of neural nets and approaches to constructing them so you can apply them in your own coding.
Andrew Blais (onlymice@gnosis.cx), Author, Gnosis Software, Inc.
David Mertz, Developer, Gnosis Software, Inc.
01 July 2001
Also available in Japanese
According to a simplified account, the human brain consists of about ten billion neurons -- and a neuron is, on average, connected to several thousand other neurons. By way of these connections, neurons both send and receive varying quantities of energy. One very important feature of neurons is that they don't react immediately to the reception of energy. Instead, they sum their received energies, and they send their own quantities of energy to other neurons only when this sum has reached a certain critical threshold. The brain learns by adjusting the number and strength of these connections. Even though this picture is a simplification of the biological facts, it is sufficiently powerful to serve as a model for the neural net.

Threshold logic units (TLUs)

The first step toward understanding neural nets is to abstract from the biological neuron, and to focus on its character as a threshold logic unit (TLU). A TLU is an object that inputs an array of weighted quantities, sums them, and if this sum meets or surpasses some threshold, outputs a quantity. Let's label these features. First, there are the inputs and their weights: X1,X2, ..., Xn and W1, W2, ...,Wn. Then, there are the Xi*Wi that are summed, which yields the activation level a, in other words:
 a = (X1 * W1)+(X2 * W2)+...+(Xi * Wi)+...+ (Xn * Wn)
The threshold is called theta. Lastly, there is the output: y. When a >= theta, y=1, else y=0. Notice that the output doesn't need to be discontinuous, since it could also be determined by a squashing function, s (or sigma), whose argument is a, and whose value is between 0 and 1. Then, y=s(a).
Figure 1. Threshold logic unit, with sigma function (top) and cutoff function (bottom)
Threshhold Logic Unit
A TLU can classify. Imagine a TLU that has two inputs, whose weights equal 1, and whose theta equals 1.5. When this TLU inputs <0>, <0>, <1>, and <1>, it outputs 0, 0, 0, and 1 respectively. This TLU classifies these inputs into two groups: the 1 group and the 0 group. Insofar as a human brain that knows about logical conjunction (Boolean AND) would similarly classify logically conjoined sentences, this TLU knows something like logical conjunction.
This TLU has a geometric interpretation that clarifies what is happening. Its four possible inputs correspond to four points on a Cartesian graph. From X1*W1+ X2*W2 = theta, in other words, the point at which the TLU switches its classificatory behavior, it follows that X2 = -X1 + 1.5. The graph of this equation cuts the four possible inputs into two spaces that correspond to the TLU's classifications. This is an instance of a more general principle about TLUs. In the case of a TLU with an arbitrary number of inputs, N, the set of possible inputs corresponds to a set of points in N-dimensional space. If these points can be cut by a hyperplane -- in other words, an N-dimensional geometric figure corresponding to the line in the above example -- then there is a set of weights and a threshold that define a TLU whose classifications match this cut.

How a TLU learns

Since TLUs can classify, they know stuff. Neural nets are also supposed to learn. Their learning mechanism is modeled on the brain's adjustments of its neural connections. A TLU learns by changing its weights and threshold. Actually, the weight-threshold distinction is somewhat arbitrary from a mathematical point of view. Recall that the critical point at which a TLU outputs 1 instead of 0 is when the SUM(Xi * Wi) >= theta. This is equivalent to saying that the critical point is when the SUM(Xi * Wi)+ (-1 * theta) >= 0. So, it is possible to treat -1 as a constant input whose weight, theta, is adjusted in learning, or, to use the technical term, training. In this case, y=1 when SUM(Xi * Wi)+ (-1 * theta) >= 0, else y=0.
During training, a neural net inputs:
  1. A series of examples of the items to be classified
  2. Their proper classifications or targets
Such input can be viewed as a vector: 1
, X2, ...,Xn, theta, t>, where t is the target or true classification. The neural net uses these to modify its weights, and it aims to match its classifications with the targets in the training set. More precisely, this is supervised training, as opposed to unsupervised training. The former is based on examples accompanied by targets, whereas the latter is based on statistical analysis (see Resources later in this article). Weight modification follows a learning rule. One idealized learning algorithm goes something like this:
Listing 1. Idealized learning algorithm
fully_trained = FALSE
DO UNTIL (fully_trained):
    fully_trained = TRUE
    FOR EACH training_vector = ::
   # Weights compared to theta
        a = (X1 * W1)+(X2 * W2)+...+(Xn * Wn) - theta
        y = sigma(a)
        IF y != target:
            fully_trained = FALSE
        FOR EACH Wi:
        MODIFY_WEIGHT(Wi)    # According to the training rule
    IF (fully_trained):
        BREAK
You're probably wondering, "What training rule?" There are many, but one plausible rule is based on the idea that weight and threshold modification should be determined by a fraction of (t - y). This is accomplished by introducing alpha (0 < alpha < 1), which is called the learning rate. The change in Wi equals (alpha * (t - y) * Xi). When alpha is close to 0, the neural net will engage in more conservative weight modifications, and when it is close to 1, it will make more radical weight modifications. A neural net that uses this rule is known as a perceptron, and this rule is called the perceptron learning rule. One result about perceptrons, due to Rosenblatt, 1962 (see Resources), is that if a set of points in N-space is cut by a hyperplane, then the application of the perceptron training algorithm will eventually result in a weight distribution that defines a TLU whose hyperplane makes the wanted cut. Of course, to recall Keynes, eventually, we're all dead, but more than computational time is at stake, since we may want our neural net to make more than one cut in the space of possible inputs.
Our initial puzzle illustrates this. Suppose that you were given N characters of code that you knew to be either C, C++, Java, or Python. The puzzle is to construct a program that identifies the code's language. A TLU that could do this would need to make more than one cut in the space of possible inputs. It would need to cut this space into four regions, one for each language. This would be accomplished by training a neural net to make two cuts. One cut would divide the C/C++ from the Java/Python, and the other cut would divide the C/Java from the C++/Python. A net that could make these cuts could also identify the language of a source code sample. However, this requires our net to have a different structure. Before we describe this difference, let's briefly turn to practical considerations.
Figure 2. Preliminary (and insufficient) Perceptron Learning Model
Preliminary (and insufficient) Perceptron Learning Model
Considerations of computational time rule out taking N-character chunks of code, counting the frequency of the visible ASCII characters, 32 through 127, and training our neural net on the basis of these counts and target information about the code's language. Our way around this was to limit our character counts to the twenty most frequently occurring non-alphanumeric characters in a pool of C, C++, Java, and Python code. For reasons that concern the implementation of floating point arithmetic, we decided to train our net with these twenty counts divided by a normalizing factor. Clearly, one structural difference is that our net has twenty input nodes, but this should not be surprising, since our description has already suggested this possibility. A more interesting difference is a pair of intermediary nodes, N1 and N2, and an increase in the number of output nodes from two to four, O1-O4.
N1 would be trained so that upon seeing C or C++, it would set y1=1, and upon seeing Java or Python, it would set y1=0. N2 would be similarly trained. Upon seeing C or Java, N2 would set y2=1, and upon seeing C++ or Python, it would set y2=0. Furthermore, N1 and N2 would output 1 or 0 to the Oi. Now, if N1 sees C or C++, and N2 sees C or Java, then the code in question is C. Moreover, if N1 sees C or C++, and N2 sees C++ or Python, the code is C++. The pattern is obvious. So, suppose that the Oi were trained to output 1 or 0 on the basis of the following table:
Intermediate nodes mapped to output (as Boolean functions)
N1N2O1 (C)O2 (C++)O3 (Java)O4 (Python)
000001
010010
100100
111000
If this worked, our net could identify the language of a code sample. It is a pretty idea, and its practical ramifications seem to be incredibly far reaching. However, this solution presupposes that the C/C++ and the Java/Python inputs are cut by one hyperplane, and that the C/Java and the C++/Python inputs are cut by the other. There is an approach to net training that bypasses this kind of assumption about the input space.

The delta rule

Another training rule is the delta rule. The perceptron training rule is based on the idea that weight modification is best determined by some fraction of the difference between target and output. The delta rule is based on the idea of gradient descent. This difficult mathematical concept has a prosaic illustration. From some given point, a Southward path may be steeper than an Eastward path. Walking East may take you off a cliff, while walking South may only take you along its gently sloping edge. West would take you up a steep hill, and North leads to level ground. All you want is a leisurely walk, so you seek ground where the overall steepness of your options is minimized. Similarly, in weight modification, a neural net can seek a weight distribution that minimizes error.
Limiting ourselves to nets with no hidden nodes, but possibly having more than one output node, let p be an element in a training set, and t(p,n) be the corresponding target of output node n. However, let y(p,n) be determined by the squashing function, s, mentioned above, where a(p,n) is n's activation relative to p, y(p,n) = s( a(p,n) ) or the squashed activation of node n relative to p. Setting the weights (each Wi) for a net also sets the difference between t(p,n) and y(p,n) for every p and n, and this means setting the net's overall error for every p. Therefore, for any set of weights, there is an average error. However, the delta rule rests on refinements in the notions of average and error. Instead of discussing the minutiae, we shall just say that the error relative to some p and n is: ½ * square( t(p,n) - y(p,n) ) (see Resources). Now, for each Wi, there is an average error defined as:
Listing 2: Finding the average error
sum = 0
FOR p = 1 TO M:         # M is number of training vectors
    FOR n = 1 TO N:     # N is number of output nodes
        sum = sum + (1/2 * (t(p,n)-y(p,n))^2)
average = 1/M * sum
The delta rule is defined in terms of this definition of error. Because error is explained in terms of all the training vectors, the delta rule is an algorithm for taking a particular set of weights and a particular vector, and yielding weight changes that would take the neural net on the path to minimal error. We won't discuss the calculus underpinning this algorithm. Suffice it to say that the change to any Wi is:
 alpha * s'(a(p,n)) * (t(p,n) - y(p,n)) * X(p,i,n).
X(p,i,n) is the ith element in p that is input into node n, and alpha is the already noted learning rate. Finally, s'( a(p,n) ) is the rate of change (derivative) of the squashing function at the activation of the nth node relative to p. This is the delta rule, and Widrow and Stearns (see Resources) showed that when alpha is sufficiently small, the weight vector approaches a vector that minimizes error. A delta rule-based algorithm for weight modification looks like this.
Downward slope (follow until error is suitably small)
step 1: for each training vector, p, find a(p)
step 2: for each i, change Wi by:
            alpha * s'(a(p,n)) * (t(p,n)-y(p,n)) * X(p,i,n)
There are important differences from the perceptron algorithm. Clearly, there are quite different analyses underlying the weight change formulae. The delta rule algorithm always makes a change in weights, and it is based on activation as opposed to output. Lastly, it isn't clear how this applies to nets with hidden nodes.

Back-propagation

Back-propagation is an algorithm that extends the analysis that underpins the delta rule to neural nets with hidden nodes. To see the problem, imagine that Bob tells Alice a story, and then Alice tells Ted. Ted checks the facts, and finds that the story is erroneous. Now, Ted needs to find out how much of the error is due to Bob and how much to Alice. When output nodes take their inputs from hidden nodes, and the net finds that it is in error, its weight adjustments require an algorithm that will pick out how much the various nodes contributed to its overall error. The net needs to ask, "Who led me astray? By how much? And, how do I fix this?" What's a net to do?
Figure 3. "Code Recognizer" back-propagation neural network

The back-propagation algorithm also rests on the idea of gradient descent, and so the only change in the analysis of weight modification concerns the difference between t(p,n) and y(p,n). Generally, the change to Wi is:
 alpha * s'(a(p,n)) * d(n) * X(p,i,n)
where d(n) for a hidden node n, turns on (1) how much n influences any given output node; and (2) how much that output node itself influences the overall error of the net. On the one hand, the more n influences an output node, the more n affects the net's overall error. On the other hand, if the output node influences the overall error less, then n's influence correspondingly diminishes. Where d(j) is output node j's contribution to the net's overall error, and W(n,j) is the influence that n has on j, d(j) * W(n,j) is the combination of these two influences. However, n almost always influences more than one output node, and it may influence every output node. So, d(n) is:
 SUM(d(j)*W(n,j)), for all j
where j is an output node that takes input from n. Putting this together gives us a training rule. First part: the weight change between hidden and output nodes, n and j, is:
 alpha * s'(a(p,n))*(t(p,n) - y(p,n)) * X(p,n,j)
Second part: the weight change between input and output nodes, i and n, is:
 alpha * s'(a(p,n)) * sum(d(j) * W(n,j)) * X(p,i,n)
where j varies over all the output nodes that receive input from n. Moreover, the basic outline of a back-propagation algorithm runs like this.
Initialize Wi to small random values.

Steps to follow until error is suitably small

Step 1: Input training vector.
Step 2: Hidden nodes calculate their outputs.
Step 3: Output nodes calculate their outputs on the basis of Step 2.
Step 4: Calculate the differences between the results of Step 3 and targets.
Step 5: Apply the first part of the training rule using the results of Step 4.
Step 6: For each hidden node, n, calculate d(n).
Step 7: Apply the second part of the training rule using the results of Step 6.
Steps 1 through 3 are often called the forward pass, and steps 4 through 7 are often called the backward pass. Hence, the name: back-propagation.

Recognizing success

With the back-propagation algorithm in hand, we can turn to our puzzle of identifying the language of source code samples. To do this, our solution extends Neil Schemenauer's Python module bpnn (see Resources). Using his module is amazingly simple. We customized the class NNin our class NN2, but our changes only modify the presentation and output of the process, nothing algorithmic. The basic code looks like:
Listing 3: Setting up a neural network with bpnn.py
# Create the network (number of input, hidden, and training nodes)
net = NN2(INPUTS, HIDDEN, OUTPUTS) 

# create the training and testing data
trainpat = []  
testpat = []  
for n in xrange(TRAINSIZE+TESTSIZE):  
   #... add vectors to each set
   
# train it with some patterns
net.train(trainpat, iterations=ITERATIONS, N=LEARNRATE, M=MOMENTUM)  

# test it 
net.test(testpat)  

# report trained weights 
net.weights()
Of course, we need input data. Our utility code2data.py provides this. Its interface is straightforward: just put a bunch of source files with various extensions into a subdirectory called ./code , and then run the utility listing these extensions as options, for example:
    python code2data.py py c java
What you get is a bunch of vectors on STDOUT , which you can pipe to another process or redirect to a file.  This output looks something like:
Listing 4: Code2Data output vectors
0.15 0.01 0.01 0.04 0.07 0.00 0.00 0.03 0.01 0.00 0.00 0.00 0.05 0.00 > 1 0 0
0.14 0.00 0.00 0.05 0.13 0.00 0.00 0.00 0.02 0.00 0.00 0.00 0.13 0.00 > 1 0 0
[...]
Recall that the input values are normalized numbers of occurrences of various special characters. The target values (after the greater than sign) are YES/NO representing the type of source code file containing these characters. But there is nothing very obvious about what is what. That's the great thing, the data could by anything that you can generate input and targets for.
The next step is to run the actual code_recognizer.py program. This takes (on STDIN ) a collection of vectors like those above. The program has a wrapper that deduces how many input nodes (count and target) are needed, based on the actual input file. Choosing the number of hidden nodes is trickier. For source code recognition, 6 to 8 hidden nodes seem to work very well. You can override all the parameters on the command line, if you want to experiment with the net to discover how it does with various options -- each run might take a while, though. It is worth noting that code_recognizer.py sends its (large) test result file to STDOUT , but puts some friendly messages on STDERR . So most of the time, you will want to direct STDOUT to a file for safe keeping, and watch STDERR for progress and result summaries.
Listing 5: Running code_recognizer.py
> code2data.py py c java | code_recognizer.py > test_results.txt 
Total bytes of py-source: 457729 
Total bytes of c-source: 245197 
Total bytes of java-source: 709858 
Input set: ) ( _ . = ; " , ' * / { } : - 0 + 1 [ ] 
HIDDEN = 8 
LEARNRATE = 0.5 
ITERATIONS = 1000 
TRAINSIZE = 500 
OUTPUTS = 3 
MOMENTUM = 0.1 
ERROR_CUTOFF = 0.01 
TESTSIZE = 500 
INPUTS = 20 
error -> 95.519... 23.696... 19.727... 14.012... 11.058... 9.652...  
8.858... 8.236... 7.637... 7.065... 6.398... 5.413... 4.508...  
3.860... 3.523... 3.258... 3.026... 2.818... 2.631... 2.463...  
2.313... 2.180... 2.065... 1.965... 1.877... 1.798... 1.725...  
[...] 
0.113... 0.110... 0.108... 0.106... 0.104... 0.102... 0.100...  
0.098... 0.096... 0.094... 0.093... 0.091... 0.089... 0.088...  
0.086... 0.085... 0.084... 
Success rate against test data: 92.60%
The decreasing error is a nice reminder, and acts as a sort of progress meter during long runs. But, the final result is what is impressive. The net does a quite respectable job, in our opinion, of recognizing code -- we would love to hear how it does on your data vectors.

Summary

We have explained the basics of the neural net in a way that should allow you to begin to apply them in your own coding. We encourage you to take what you have learned here and attempt to write your own solution to our puzzle. (See Resources for our solution.)

Resources

  • Our Code Recognizer program is based on Neil Schemenauer's back propagation module.
  • For the distinction between supervised and unsupervised training, and neural nets in general, see Machine Learning, Neural and Statistical Classification, edited by D. Michie, D.J. Spiegelhalter, and C.C. Taylor. Specifically, see Chapter 6.
  • For Rosenblatt's result about perceptrons, see his Principles of Neurodynamics, 1962, New York: Spartan Books.
  • For some of the minutiae of the delta rule, see Kevin Gurney's excellent An Introduction to Neural Networks, 1997, London: Routledge. Also see Neural Nets for an early on-line version.
  • For the proof about the delta rule, see: B. Widrow and S.D. Stearns, Adaptive Signal Processing, 1985, New Jersey: Prentice-Hall.
  • For an implementation of the perceptron with a GUI, see The Perceptron by Omri Weisman and Ziv Pollack.
  • What is a subject without its FAQ? See the ever useful Neural Net FAQ.
  • For a wide-ranging collection of links, see The Backpropagator's Review.
  • University courses are a rich source of information for the beginner in any subject. Consult this list of courses.
  • The Neural Network Package is a LGPL package written in Java. It permits experimentation with all the algorithms discussed here.
  • See Neural Networks at your Fingertips for a set of C packages that illustrate Adaline networks, back-propagation, the Hopfield model, and others. A particularly interesting item is The Back-propagation Network, a C package that illustrates a net that analyzes sunspot data.
  • At Neural Networking Software, you will find neural net code with graphical interfaces, and it's both DOS and Linux friendly.
  • NEURObjects provides C++ libraries for neural network development, and this has the advantage of being object oriented.
  • Stuttgart Neural Network Simulator (SNNS) is what the name says. It is written in C with a GUI. It has an extensive manual, and is also Linux friendly.
  • Browse more Linux resources on developerWorks.
  • Browse more Open source resources on developerWorks.

Comments

Sign in or register to leave a comment.
Note: HTML elements are not supported within comments.

1000 characters left
 Total comments (4)
Nice article. I have a simple question that might be very simple to answer but hey, I'm learning. So the question is:

How do I make TLU behave like NOR, where X1 and X2 are input to a logical circuit and y equals output of logical cirucit .

What puzzles me is:
a=(X1*W1)+(X2*W2) - what I want is n=(0*W1)+(0*W2) where n not equal 0, to be true...
Posted by R.R. on 14 May 2012
test
Posted by fakeid on 03 August 2009
Interesting idea. You should write about statistical results of such classification.
I wrote software for better understanding of neural networks for classification (under Windows). Enjoy:
http://www.sharktime.com/us_SharkyNeuralNetwork.html
Posted by fakeid on 03 August 2009
test comment -- disregard
Posted by JonOelrich on 29 July 2009

Dig deeper into Linux on developerWorks


Labels: , ,

0 Comments:

Post a Comment

Subscribe to Post Comments [Atom]

<< Home