Scriptid is a program and a library that can be used to determine whether a given text file contains code of a specified programming language.
The current release can tell whether a file contains vbscript or not. It should be possible to extend this to any number of other languages.
This started from the idea that every programming language has unique words and unique usage of words relative to any other programming language and a normal document like this document. I started by building a word list of reserved words which is stored in vbscript.words. I also took all possible symbols that could be used to separate reserved words, like ";", "+" etc. This I placed in vbscript.seperators. These contents I use to parse the target file to determine the ratio of specialized separators to normal separators like white space. I also calculate the ratios of reserved words to other words and generate a histogram of reserved words, normalized by the total number of reserved words. This is thrown at a neural network and out comes a neural net that can identify vbscript files.
To train a neural network you need example files to work with. At this stage I have 25,000 of them. I do not intend to share these files. There are several reasons. The most important one is that I want to encourage people that find that scriptid made a mistake to share their files with me with the knowledge that I will not share it with anyone. If you want to do your own training of the neural net, then I can provide you with the data files containing all the statistics of the files. There is nothing in those files that will allow you to recreate the original file, but it does contain enough information to train the neural network.
Neural networks are very complex curve fit algorithms. It allows you to take a small amount of data and extrapolate from that other data. This allows me to take a few vbscript files and use that to identify other vbscript files. For any of you that has used curve fitting in the past you will know that the technique is just as good as your sample points. It can make mistakes. My last test showed scriptid to be 96.3% accurate on totally unknown files. Obviously I want to improve that so I need to add badly identified files to the data set so that identification can be improved by additional learning.
There are three types of download available for this project. The executables and libraries, the neural network weights and the data files. The reason for this is that the data files and neural network weights can and will most likely change without any need for the executables to change. You do need to download at least the executables and the neural network weights.
It is important to regularly update the neural network weights package.
The README contains a lot of useful information. Please read it.
The CHANGELOG is available.
I have started an FAQ. Please submit more questions as you think of them.
The data files needed to train the neural network is not currently on this page. Please contact me if you need them. They are 68MB in size.
| Version | Release | File | Notes |
|---|---|---|---|
| 0.0.3 | 1 March 2003 | Source .tar.bz2 | Major Bug Fix |
| Executable binary .rpm | Built on Mandrake Linux 9.0 | ||
| Source RPM | Major Bug Fix | ||
| Neural Network Weights RPM | 2003/03/01 08:48 | ||
| Neural Network Weights .tar.bz2 | 2003/03/01 08:48 | ||
| 0.0.2 | 26 February 2003 | Source .tar.bz2 | Added detection of text files |
| Executable binary .rpm | Built on Mandrake Linux 9.0 | ||
| Source RPM | Added detection of text files | ||
| Neural Network Weights RPM | 2003/02/26 16:21 | ||
| Neural Network Weights .tar.bz2 | 2003/02/26 16:21 | ||
| 0.0.1 | 22 February 2003 | Source .tar.bz2 | First Release |
| Executable binary .rpm | Built on Mandrake Linux 9.0 | ||
| Source RPM | First Release | ||
| Neural Network Weights RPM | 2003/02/22 08:22 | ||
| Neural Network Weights .tar.bz2 | 2003/02/22 08:22 |