No description

Find a file

Loweel bef4972180 Better statistics and alphanumerical tokens		2019-12-12 14:13:59 +01:00
.vscode	first commit	2019-11-22 11:35:16 +01:00
assets	first commit	2019-11-22 11:35:16 +01:00
vendor	first commit	2019-11-22 11:35:16 +01:00
.gitignore	Better statistics and alphanumerical tokens	2019-12-12 14:13:59 +01:00
alloc.go	New Bayes Engine	2019-12-11 12:10:36 +01:00
bayes.json	Better statistics and alphanumerical tokens	2019-12-12 14:13:59 +01:00
bindata.go	first commit	2019-11-22 11:35:16 +01:00
blacklist.txt	Introducing whitelist and blacklist	2019-12-05 11:40:07 +01:00
classifier.go	Better statistics and alphanumerical tokens	2019-12-12 14:13:59 +01:00
file.go	Back to old 3 folded models.	2019-12-12 11:21:47 +01:00
go.mod	first commit	2019-11-22 11:35:16 +01:00
go.sum	first commit	2019-11-22 11:35:16 +01:00
handler.go	New Bayes Engine	2019-12-11 12:10:36 +01:00
log.go	first commit	2019-11-22 11:35:16 +01:00
main.go	Configurable refresh	2019-12-03 16:47:24 +01:00
matrix.go	Added a generation token in stats	2019-12-12 13:48:12 +01:00
README.md	New Bayes Engine	2019-12-11 12:10:36 +01:00
run.sh	Capability to store URL PATH added	2019-12-04 14:35:40 +01:00
whitelist.txt	Improved Bayesian	2019-12-06 15:11:18 +01:00
zardoz.json	Added a generation token in stats	2019-12-12 13:48:12 +01:00
zgc.go	first commit	2019-11-22 11:35:16 +01:00

README.md

Zardoz: a lightweight WAF , based on Pseudo-Bayes machine learning.

Zardoz is a small WAF, aiming to take off HTTP calls which are well-known to end in some HTTP error. It behaves like a reverse proxy, running as a frontend. It intercepts the calls, forwards them when needed and learns how the server reacts from the Status Code.

After a while, the bayes classifier is able to understand what is a "good" HTTP call and a bad one, based on the header contents.

It is designed to don't consume much memory neither CPU, so that you don't need powerful servers to keep it running, neither it can introduce high latency on the web server.

STATUS:

This is just an experiment I'm doing with Pseudo-Bayes classifiers. It works pretty well with my blog. Run in production at your own risk.

Compiling:

Requirements:

golang >= 1.12.9

build:

git clone https://git.keinpfusch.net/LowEel/zardoz 
cd zardoz
go build

Starting:

Zardoz has no configuration file, it entirely depends from environment string.

In Dockerfile, this maps like:

ENV REVERSEURL	http://10.0.1.1:3000
ENV PROXYPORT	:17000
ENV TRIGGER	0.6
ENV SENIORITY	1025
ENV DEBUG false
ENV DUMPFILE /somewhere/bayes.txt
ENV REFRESHTIME 24h

Using a bash script, this means something like:

export REVERSEURL=http://10.0.1.1:3000 
export PROXYPORT=":17000" 
export TRIGGER="0.6"
export SENIORITY="1025"
export DEBUG="true"
export DUMPFILE="/somewhere/bayes.txt"
export REFRESHTIME 24h
./zardoz

Understanding Configuration:

REVERSEURL is the server zardoz will be a reverse proxy for. This maps to IP and port of the server you want to protect.

PROXYPORT is the IP and PORT where zardoz will listen. If you want zardoz to listen on all ports, just write like ":17000", meaning, it will listen on all interfaces at port 17000

TRIGGER: this is one of the trickiest part. We can describe the behavior of zardoz in quadrants, like:

-	BAD > GOOD	BAD < GOOD
\| GOOD - BAD \| > TRIGGER	BLOCK	PASS
\| GOOD - BAD \| <= TRIGGER	BLOCK+LEARN	PASS+LEARN

The value of trigger can be from 0 to 1, like "0.5" or "0.6". The difference between BLOCK without learning and block with learning is execution time. On the point of view of user experience, it will change nothing (user will be blocked) but in case of "block+learn" the machine will try to learn the lesson.

Basically, if the GOOD and BAD are very far, "likelyhood" is very high, so that block and pass are taken strictly.

If the likelyhood is lesser than TRIGGER, then we aren't sure the prediction is good, so zardoz executes the PASS or BLOCK, but it waits for the response , and learns from it. To summerize, the concept is about "likelyhood", which makes the difference between an action and the same action + LEARN.

Personally I've got good results putting the trigger at 0.6, meaning this is not disturbing so much users, and in the same time it has filtered tons of malicious scan.

SENIORITY: since Zardoz will learn what is good for your web server, it takes time to gain seniority. To start Zardoz as empty and leave it to decide will generate some terrible behavior, because of false positives and false negatives. Plus, at the beginning Zardoz is supposed to ALWAYS learn.

The parameter "SENIORITY" is then the amount of requests it will set in "PASS+LEARN" before the filtering starts. During this time, it will learn from real traffic. It will block no traffic unless "seniority" is reach. If you set it to 1025, it will learn from 1025 requests and then it will start to actually filter the requests. The number depends by many factors: if you have a lot of page served and a lot of contents, I suggest to increase the number.

DUMPFILE

This is where you want the dumpfile to be saved. Useful with Docker volumes.

REFRESHTIME

Interval to refresh the spurious records. Some string is classified both as good and bad. ('Meh'). To optimize and make the daemon smaller, we clean them from time to time. REFRESHTIME is the amount of hours between cleaning. Duration is in Golang time.Duration syntax, like "24h" or "1d10h31m".

TROUBLESHOOTING:

If DEBUG is set to "false" or not set, minute Zardoz will dump the sparse matrix describing to the whole bayesian learning, into a file named bayes.json. This contains the weighted matrix of calls and classes. If Zardoz is not behaving like you expected, you may give a look to this file. The format is a classic sparse matrix. WARNING: this file may contain cookies or other sensitive headers.

DEBUG : if set to "true", Zardoz will create a folder "logs" and log what happens, together with the dump of sparse matrix. If set to "false" or not set, sparse matrix will be available on disk for post-mortem.

CREDIT

Credits for the Bayesian Implementation to Jake Brukhman : https://github.com/jbrukh/bayesian

TODO:

Loading Bayesian data from file.
Better Logging
Configurable block message.
Usage Statistics/Metrics sent to influxDB/prometheus/whatever