Projects

Script Id

Purpose

ScriptId is a program and a library that can be used to determine whether a given text file contains code of a specified programming language.

The current release can tell whether a file contains vbscript or not. It should be possible to extend this to any number of other languages.

Target Audience

People that want to write plug-ins for anti-virus or content filtering programs to detect whether a file is vbscript (or any other type of script) or not. It can be used either as a C/C++ library or as a executable called from a script. It has not been specifically targeted to identify viruses, although that could theoretically be possible.

Concepts

This started from the idea that every programming language has unique words and unique usage of words relative to any other programming language and a normal document like this document. I started by building a word list of reserved words which is stored in vbscript.words. I also took all possible symbols that could be used to separate reserved words, like ";", "+" etc. This I placed in vbscript.seperators. These contents I use to parse the target file to determine the ratio of specialized separators to normal separators like white space. I also calculate the ratios of reserved words to other words and generate a histogram of reserved words, normalized by the total number of reserved words. This is thrown at a neural network and out comes a neural net that can identify vbscript files.

To train a neural network you need example files to work with. At this stage I have 25,000 of them. I do not intend to share these files. There are several reasons. The most important one is that I want to encourage people that find that scriptid made a mistake to share their files with me with the knowledge that I will not share it with anyone. If you want to do your own training of the neural net, then I can provide you with the data files containing all the statistics of the files. There is nothing in those files that will allow you to recreate the original file, but it does contain enough information to train the neural network.

Neural networks are very complex curve fit algorithms. It allows you to take a small amount of data and extrapolate from that other data. This allows me to take a few vbscript files and use that to identify other vbscript files. For any of you that has used curve fitting in the past you will know that the technique is just as good as your sample points. It can make mistakes. My last test showed scriptid to be 96.3% accurate on totally unknown files. Obviously I want to improve that so I need to add badly identified files to the data set so that identification can be improved by additional learning.

License

Lesser GPL

Operating Systems

Currently there is support for the following operating systems:

  • Linux

Download

There are three types of download available for this project. The executables and libraries, the neural network weights and the data files. The reason for this is that the data files and neural network weights can and will most likely change without any need for the executables to change. You do need to download at least the executables and the neural network weights.

It is important to regularly update the neural network weights package.

The README contains a lot of useful information. Please read it.

The CHANGELOG is available.

I have started an FAQ. Please submit more questions as you think of them.

The data files needed to train the neural network is not currently on this page. Please contact me if you need them. They are 68MB in size.

VersionReleaseFileNotes
0.0.31 March 2003Source .tar.bz2Major Bug Fix
  Executable binary .rpmBuilt on Mandrake Linux 9.0
  Source RPMMajor Bug Fix
  Neural Network Weights RPM2003/03/01 08:48
  Neural Network Weights .tar.bz22003/03/01 08:48
0.0.226 February 2003Source .tar.bz2Added detection of text files
  Executable binary .rpmBuilt on Mandrake Linux 9.0
  Source RPMAdded detection of text files
  Neural Network Weights RPM2003/02/26 16:21
  Neural Network Weights .tar.bz22003/02/26 16:21
0.0.122 February 2003Source .tar.bz2First Release
  Executable binary .rpmBuilt on Mandrake Linux 9.0
  Source RPMFirst Release
  Neural Network Weights RPM2003/02/22 08:22
  Neural Network Weights .tar.bz22003/02/22 08:22

Links

  • libneural which formed the basis of the neural network