Block all outgoing traffic except Tor.

While doing some software development recently I wanted to be certain that the program was only making outbound connections over Tor and not leaking in unexpected ways, eg DNS. Google mostly failed me, so I rolled my own.

Cutting to the chase, here’s the solution I came up with for Ubuntu:


sudo ufw reset
sudo ufw allow out 9001/tcp
sudo ufw allow out 9030/tcp
sudo ufw deny out 1:65535/tcp
sudo ufw deny out 1:65535/udp
sudo ufw enable

Usage:
Save this to ufw_onlytor.sh. Run it, and all tcp/udp outbound connections will be blocked except 9001 and 9030, which are standard Tor ports. To disable, run sudo ufw disable. If you have an existing ufw firewall setup, you should back it up first.

Notes:

  • Tor also uses ports 80 and 443. I disabled these because they are common “leakage” ports. It might take longer for your Tor node to find peers at startup when these ports are not available.
  • Your client connection to Tor’s socks proxy normally takes place over localhost:9050, and is unaffected by these firewall rules.
Posted in post

SQL select to find UTF-8 chars (non-ascii) in Postgresql database.

I wanted to quickly find strings in a datbase with UTF-8 chars in them to ensure they are being input’ed and displayed properly.

I came up with the following two statements that seem to work equivalently:

select col from tablename where col ~ '[^[:ascii:]]';

and

select col from tablename where col ~ '[\x80-\10FFFF]';

Similar variations will likely work in other database systems as well.

Posted in post

Apache Pretty URLs with ?foo=bar parameters.

I recently had the problem that I needed to have pretty urls with get parameters of the form:

http://my.site.com/article/article1?page=3

http://my.site.com/article/article2?page=5

Where article1 and article2 would each map to the same underlying PHP script, processor.php, and pass it parameters ?article=<id>&page=<pgnum>. I found many postings on the net demonstrating how to achieve pretty URLs in apache configuration like so:


RewriteRule ^/article/(.*)/?$ %{DOCUMENT_ROOT}/processor.php?article=$1

Or a similar variations of that.

So I figured that to pass the remainder of the query string something like this should work:


RewriteRule ^/article/(.*)/?\??(.*)$ %{DOCUMENT_ROOT}/processor.php?cmd=$1&$2

Where we would append everything after ? in the original query string and place it after & in our new one.

But alas, no.

mod_rewrite wanted nothing to do with that. No matter what regex I tried, my ?page=3 argument was not getting through to my PHP script.

Fortunately, mod_rewrite does offer a solution. Just append [QSA]. Like so:


RewriteRule ^/article/(.*)/? %{DOCUMENT_ROOT}/processor.php?article=$1 [QSA]

The mod_rewrite docs have this to say on the matter:

Modifying the Query String

By default, the query string is passed through unchanged. You can, however, create URLs in the substitution string containing a query string part. Simply use a question mark inside the substitution string to indicate that the following text should be re-injected into the query string. When you want to erase an existing query string, end the substitution string with just a question mark. To combine new and old query strings, use the [QSA] flag.

So by including a ? in my original substitution rule I was effectively erasing and replacing the existing query string. But QSA brings back the other parameters, and my problem is solved.

Posted in post

automate mysql audit tables with cdc_audit. auto generate, populate, and sync.

Audit tables in a database are useful for several purposes:

For data warehousing purposes, we are primarily interested in the change data capture aspect. Setup of a trigger based approach to CDC can be automated for any or all tables and avoids the need to change any existing tables or applications that depend on them. This approach also enables us to efficiently identify only the rows that have changed since our last ETL sync and it also captures deleted rows, something that can be tricky with timestamps in the original table.

Introducing cdc_audit

cdc_audit is a little tool I wrote that presently contains two scripts:

  • cdc_audit_gen_mysql.php : auto generates audit tables and triggers for all or selected tables in a DB.
  • cdc_audit_sync_mysql.php : syncs diffs in audit table(s) to target .CSV file(s) that can be used in hadoop, map-r, etc.

A nice feature of cdc_audit is that it simplifies dealing with a pesky mysql trigger limitation. In mysql, only one trigger may be defined at a time for a given table + event, eg only one AFTER INSERT may be defined. So if we already have a trigger defined for a table that we want to audit, we have a problem. cdc_audit solves the problem by automatically incorporating the old trigger statement into the new trigger,so the existing trigger logic is preserved, but we also get the new logic to insert changed rows into the audit table.

Another feature is incremental audit table wipe. There are a few audit table patterns, and the pattern used by cdc_audit is performant and simple but can use a lot of disk space because each changed row is duplicated for every update to the source table, even if only a single column changed. For CDC purposes, our primary goal is to identify and sync the changed rows efficiently; we may not care much or at all about the audit history for any purposes beyond that. And for large tables, we do care about disk space. Incremental audit table wipe is an optional feature that can delete all but the last row in the audit table. Keeping the last row enables the sync process to continue working. Rows are deleted incrementally, 100 rows at a time, to avoid putting too much load on the database and preventing concurrent sessions from writing to the audit table.

Features

  • automates generation of audit tables, one audit table per source table
  • automates generation of triggers to populate audit tables
  • automates syncing of new rows in audit tables to .csv files.
  • fast triggers. only one insert and 0 selects per trigger execution.
  • Reads mysql information_schema to automatically determine tables and columns.
  • Can generate tables + triggers for all database tables, or a specified list.
  • Can sync audit tables for all database tables, or a specified list.
  • Retains pre-existing trigger logic, if any, when generating AFTER triggers.
  • sync script option to delete all but last audit row, to keep source DB small.
Posted in data warehousing, mysql, php, post, tools

HOWTO: Setting up shard-query framework environment

HOWTO: Setup shard-query framework environment

1 How to setup shard query environment for Scale out Partition Based parallel Database Systems
1.1 INSTALLING SHARD QUERY
1.2 CHECKING ENVIRONMENT
1.3 PERFORMING A TEST

How to setup shard query environment for Scale out Partition Based parallel Database Systems

ShardQuery is a PHP class which is intended to make working with a partitioned dataset easier. It offers an easy to use PHP class interface for querying the dataset.
CODE HERE http://code.google.com/p/shard-query/

Shardquery definition from the autor:

What is it?  ShardQuery is a PHP class which is intended to make working with a partitioned dataset easier. It offers an easy to use PHP class interface for querying the dataset. Also included is run_query.php, which is an example application for the ShardQuery class.  Shard-Query can be used transparently, and includes an experimental LUA script for MySQL proxy.
Key Features
ParallelPipelining – MPP distributed query engines runs fragments of queries in parallel, combining the results at the end. Like map/reduce except it speaks SQL directly.
Access many shards in parallel
QueryRouting – Sends queries only to the shard containing the requested data.
ConditionPushdown – Aggregation, joins and filtering are always performed at the shard level which fully distributes the work
Gearman Workers – PHP is not threaded. Gearman (Net_Gearman) is leveraged instead.
Parallel loader and pre-splitter can load or preprocess delimited files with ease. Overhead of looking up/creating new shard keys is spread over coprocesses

1. INSTALLING shard query

1) Check your  php configuration

**PHP 5.3.3
[root@localhost php-5.3.3]# cat config.nice|grep -v \#
‘./configure’ \
‘–enable-json’ \
‘–with-pdo-mysql=mysqlnd’ \
‘–with-mysqli=mysqlnd’ \
‘–with-mysql=mysqlnd’ \
‘–enable-sockets’ \
‘–enable-pcntl’ \
“$@”

2) Install Net_Gearman libs

Net_Gearman : Net_Gearman based PHP worker script, as well as a function which requests a resultset from each shard.

yum install php-pear
pear install channel://pear.php.net/Net_Gearman-0.2.3
pecl install json

3) Download/install SHARD QUERY

http://code.google.com/p/shard-query/source/checkout  , by the way you should copy the include/ and Net/ directories into your PHP include path. Net_Gearman will attempt to autoload the files in the Net/Gearman/Job folder. So In your include path run this.

svn co http://shard-query.googlecode.com/svn/trunk/include/ <php include path>/include/

4) Install GEARMAN SERVER:

To install, first set up the Yum repository on your system, Then install the gearmand daemon and pecl extension:

rpm -Uvh http://repo.webtatic.com/yum/centos/5/latest.rpm
yum install gearmand php-pecl-gearman –enablerepo=webtatic

Remember to add to php include path the net-gearmandlib and the folder Include from shardquery, If you cant edit your php.ini add this to the script set_include_path(‘.:/usr/share/pear’);

2) CHECKING ENVIRONMENT

1) Start services

[root@jamesjara ~]# /etc/init.d/mysqld start
[root@jamesjara ~]# /etc/init.d/gearmand start

DEFAULT PORTS 0.0.0.0:4730/gearmand 0.0.0.0:3306/mysqld

2) Start workers

About the workers: Gearman::Worker is a worker class for the Gearman distributed job system, providing a framework for receiving and serving jobs from a Gearman server. TO DEBUG ERRORS: check cd workers/worker.log COMMON PROBLEMS: Gearman library is not in the worker php script. Gearman must be installed as a pear lib, and ALSO the shard-query gearman scripts in workers directory.Or more simple add net_gearman and Include folder of shardquery to php include folder.

./start_workers 1(number of workers)
note:check workers.log if any problem

3) Execute a query test

_example1]$ ./run_query.php

if you see a Unknown database error, its okey because you have any db created.

 

3) PERFORMING A TEST

FILTER: day 2000-01-01

PLease remember to configure your  shard ini, with your own database settings.

database=localhost shard1=ontime shard2=ontime2

1. run the script populate_db.PHP to populate db with random data.populate_db will add for each year(2000,2012) all months,for each month(12) all days(31)..4320rows. also the script creates schema and tables and will output db shards data for shards ini config.
2. execute manually the  following queries to create duplicate data.

USE ontime;
insert into ontime values(‘ontime.james’,’2000-01-01′,0);
insert into ontime values(‘ontime.james’,’2000-01-01′,0);
insert into ontime values(‘ontime.james’,’2000-01-01′,0);
insert into ontime values(‘ontime.james’,’2000-01-01′,0);
insert into ontime values(‘ontime.uniquevalues.jamesjara’,’2000-01-01′,0);
insert into ontime values(‘ontime2.james’,’2000-01-01′,0);  #inthis case ontime2.james is NOT in ontime2 ,its just for test duplicates in diferente shards

DATA IN THE TABLE ONTIME FROM SHARD1
use ontime;
select scheme,FlightDate,number from ontime where FlightDate BETWEEN ’2000-01-01′ and ’2000-01-01′;
+———————+————+——–+
| scheme              | FlightDate | number |
+———————+————+——–+
| ontime.james        | 2000-01-01 |      0 | DUPLICATE IN THIS SHARD
| ontime.james        | 2000-01-01 |      0 | DUPLICATE IN THIS SHARD , 2 times
| ontime.james        | 2000-01-01 |      0 | DUPLICATE IN THIS SHARD , 3 times
| ontime.james        | 2000-01-01 |      0 | DUPLICATE IN THIS SHARD , 4 times
| ontime.uniquevalues | 2000-01-01 |      0 | UNIQUE ROW
| ontime2.james       | 2000-01-01 |      0 | DUPLICATED IN SHARD2
+———————+————+——–+

DATA IN THE TABLE ONTIME FROM SHARD2
use ontime2;
select scheme,FlightDate,number from ontime where FlightDate BETWEEN ’2000-01-01′ and ’2000-01-01′;
+—————+————+——–+
| scheme        | FlightDate | number |
+—————+————+——–+
| ontime2.james | 2000-01-01 |      0 | DUPLICATED IN SHARD1
+—————+————+——–+

As you can see we have unique data across all the diferent shards..  so now lets run queries..

1. RUN

[websites@james _example3]$ ./run_query.php

2. SQ INPUT QUERY:

select scheme,FlightDate,number from ontime where FlightDate BETWEEN ’2000-01-01′ and ’2000-01-01′;
the sq input query is in the file test1.sql

3. OUTPUT/RESULTS:

To see the output go to the file output.html, remember run runquery to update the file

$ firefox `pwd`/output.html
or for remote host
lynx -print output.html

RESULT:

+———————–+————+——–+
| scheme               | FlightDate | number |
+———————–+————+——–+
| ontime2.james            | 2000-01-01 |      0 | DUPLICATED 2 times, 1 TIME IN SHARD1(ontime) and 1TIME IN SHARD2(ontime2)
| ontime.james          | 2000-01-01 |      0 | DUPLICATED 4 TIMES ONLY IN SHARD1(ontime)
| ontime.uniquevalues   | 2000-01-01 |      0 | UNIQUE ROW IN SHARD1(ontime)
+———————–+————+——–+

Thanks for your reading, If you want more update follow my at @jamesjara

 

Tagged with: , , ,
Posted in post

Installing guide for Hadoop ,Hive ,Derby on linux centos

Installing guide for Hadoop ,Hive ,Derby on linux centos

Hi, all.. here in OSC we have been testing different technologies for BIG DATA, sql and non-sql solutions..

So i have this super simple guide, related about to install.. in the future i will talk about each program deeply.

  • INSTALLING HADOOP IN centos 6
  • INSTALLING HIVE IN centos 6
  • INSTALLING DERBY IN centos 6

hadoop-0.20.203.0rc1

this is the guide for the installation of Hadoop ecosystem,  is very extended so please follow step by step

INSTALLATION of hadoop

1. Installing java
yum  install sun-java6-jdk

2.Adding a dedicated user for hadoop
This will add the user hdoopuser and the group hdoopgroup to your local machine.
/usr/sbin/useradd hdoopuser
groupadd hdoopgroup
usermod -a -G hdoopgroup hdoopuser

3.Configuring SSH
su – hdoopuser        #login as hdoopuser
ssh-keygen -t rsa -P “”    #generate key without password
cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys    #enable the new key
chmod 0600 $HOME/.ssh/authorized_keys    #enable empty password

4.Disabling IPv6
sed -i ‘s/^\(NETWORKING\s*=\s*\).*$/\NETWORKING=NO/’ /etc/sysconfig/network

5.Installation/Conf/startup of Hadoop
mkdir /hadoop
chown -R hdoopuser /hadoop
cd /hadoop/
wget http://mirrors.abdicar.com/Apache-HTTP-Server//hadoop/common/stable/hadoop-0.20.203.0rc1.tar.gz
tar -xvzf hadoop-0.20.203.0rc1.tar.gz
ln -s /hadoop/hadoop-0.20.203.0rc1/ /hadoop/hadoop
cd /hadoop/hadoop

Configuratin Hadoop

1)
vim conf/core-site.xml
#Add the following inside the <configuration> tag
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9000/</value>
</property>
<property>
<name>dfs.permissions</name>
<value>false</value>
</property>
2)
vim conf/hdfs-site.xml
#Add the following inside the <configuration> tag
<property>
<name>dfs.name.dir</name>
<value>/hadoop/hdfs/name</value>
</property>
<property>
<name>dfs.data.dir</name>
<value>/hadoop/hdfs/data</value>
</property>
<property>
<name>dfs.replication</name>
<value>2</value>
</property>
3)
vim conf/mapred-site.xml
#Add the following inside the <configuration> tag
<property>
<name>mapred.job.tracker</name>
<value>localhost:9001</value>
</property>
4)
vim conf/hadoop-env.sh
export JAVA_HOME=/opt/jre/
export HADOOP_OPTS=-Djava.net.preferIPv4Stack=true
5)
Fomart nodes
su – hdoopuser
cd /hadoop/hadoop
bin/hadoop namenode -format
6)Start hadoop
bin/start-all.sh
notes:  HTTP CONSOLE OF HADOOP
http://localhost:50030/ for the jobtrackeR
http://localhost:50070/ for the namenode

5.Installation/Conf/startup of Hive/Derby

5. Installing, configuring and starting hive/derby

cd /hadoop
wget http://mirrors.ucr.ac.cr/apache//hive/stable/hive-0.8.1-bin.tar.gz
tar -xvzf hive-0.8.1-bin.tar.gz
ln -s /hadoop/hive-0.8.1-bin/ /hadoop/hive
export HADOOP_HOME=/hadoop/hadoop/
cd /hadoop/hive
mv conf/hive-default.xml.template conf/hive-site.xml\
Testing hive
bin/hive
> show tables;
Installing derby metadatastore
cd /hadoop
wget http://archive.apache.org/dist/db/derby/db-derby-10.4.2.0/db-derby-10.4.2.0-bin.tar.gz
tar -xzf db-derby-10.4.2.0-bin.tar.gz
ln -s db-derby-10.4.2.0-bin derby
mkdir derby/data
export DERBY_INSTALL=/hadoop/derby/
export DERBY_HOME=/hadoop/derby/
export HADOOP=/hadoop/hadoop/bin/hadoop

vim /hadoop/hadoop/bin/start-dfs.sh
#add to the file start-dfs.sh the next 2 lines
cd /hadoop/derby/data
nohup /hadoop/derby/bin/startNetworkServer -h 0.0.0.0 &

vim /hadoop/hadoop/bin/start-all.sh
#add to the file start-all.sh the next 2 lines
cd /hadoop/derby/data
nohup /hadoop/derby/bin/startNetworkServer -h 0.0.0.0 &

 

Configuring hive

installing web panel for hive , search and replace

vim /hadoop/hive/conf/hive-site.xml
#search for “javax.jdo.option.ConnectionURL” and edit like the following
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:derby://localhost:1527/metastore_db;create=true</value>
<description>JDBC connect string for a JDBC metastore</description>
</property>
HTTP CONSOLE OF HIVE
bin/hive –service hwi &
URL: http://localhost:9999/

#create new file
vim /hadoop/hive/conf/jpox.properties
#add the following
javax.jdo.PersistenceManagerFactoryClass=org.jpox.PersistenceManagerFactoryImpl
org.jpox.autoCreateSchema=false
org.jpox.validateTables=false
org.jpox/usr/share/javadoc/java-1.6.0-openjdk/jre/.validateColumns=false
org.jpox.validateConstraints=false
org.jpox.storeManagerType=rdbms
org.jpox.autoCreateSccp /hadoop/derby/lib/derbytools.jar  /hadoop/hive/libhema=true
org.jpox.autoStartMechanismMode=checked
org.jpox.transactionIsolation=read_committed
javax.jdo.option.DetachAllOnCommit=true
javax.jdo.option.NontransactionalRead=true
javax.jdo.option.ConnectionDriverName=org.apache.derby.jdbc.ClientDriver
javax.jdo.option.ConnectionURL=jdbc:derby://localhost:1527/metastore_db;create=true
javax.jdo.option.ConnectionUserName=APP
javax.jdo.option.ConnectionPassword=mine
#now copy derby jar sources to Hive lib
cp /hadoop/derby/lib/derbyclient.jar /hadoop/hive/lib
cp /hadoop/derby/lib/derbytools.jar  /hadoop/hive/lib

HTTP CONSOLE OF HIVE
http://localhost:9999/hwi/ for the hive

6.START CLUSTER

/hadoop/hadoop/bin/start-all.sh
/hadoop/hive/bin/hive –service hwi &   #hwi=webpanel

7. FOR NEXT TIME AND EVER. Create a bash profile

vi /etc/profile
export JAVA_HOME=/opt/jre/
export HADOOP_OPTS=-Djava.net.preferIPv4Stack=true
export HADOOP_HOME=/hadoop/hadoop/
export DERBY_INSTALL=/hadoop/derby/
export DERBY_HOME=/hadoop/derby/
export HADOOP=/hadoop/hadoop/bin/hadoop

 

Running all ecosystem of hadoop + hive

PANELS:
http://localhost:50030/ for the jobtrackeR
http://localhost:50060/ for the  tasktracker
http://localhost:50070/ for the namenode
http://localhost:9999/hwi/ for the hive

 

thanks for the reading,  @jamesjara , O.S.C

Tagged with: , ,
Posted in post

Mysql Cluster Guide

What is mysql cluster?

Web definitions
  • MySQL Cluster is a technology which provides shared-nothing clustering capabilities for the MySQL database management system. It was first included in the production release of MySQL 4.1 in November 2004. … en.wikipedia.org/wiki/MySQL_Cluster

 

mysql cluster flow

mysql cluster flow

This guide is based in that you already know about mysql cluster.. at least about how it works.. so i will give you some tips about mysql cluster… very usefull tips..

1.)  How to convert an existent schema to mysql cluster

For example how to convert innodb schema to mysql cluster schema , the schema of mysql cluster its called NBCLUSTER

Importing schema is so simple like change InnoDB engine to NDBCLUSTER engine.
cat schema.james.sql | sed ‘s/InnoDB/NDBCLUSTER/gi’ > ndb_schema.james.sql    #migration of DDL is so simple like change engine of tables to NDBCLUSTER eng.
cat schema.jamesjara.sql | sed ‘s/InnoDB/NDBCLUSTER/gi’ > ndb_schema.jamesjara.sql

mysql -u root -e ‘create database james’
mysql -u root -e ‘create database jamesjara’
mysql -u root  james < ndb_schema.james.sql
mysql -u root  jamesjara < ndb_schema.jamesjara.sql

2) Scaling across multiple machines with mysql cluster

First , I recommend to read this links to understand data distribution at 100%

The data and indexes must fit into memory. Each cluster has node groups. Each node group holds a fragment of the data, and each node group has a number of replicas. So a cluster with 2 node groups with 4 replicas each has the data split in half, with 4 machines redundantly storing one half the data, and another 4 redundantly storing the other half. Therefore, the data doesn’t have to entirely fit into RAM, but it does have to fit in (RAM of 1 node)*(# node groups).

(from the blog)Tables are ‘horizontally fragmented’ into table fragments each containing a disjoint subset of the rows of the table. The union of rows in all table fragments is the set of rows in the table. Rows are always identified by their primary key. Tables with no primary key are given a hidden primary key by MySQLD.

3) How to get the Database status of a mysql cluster ecosystem

DATABASE STATUS

  • TABLES MEMORY    :    SHOW TABLE STATUS LIKE ‘%’; –IN MYSQLD
  • NODES MEMORY    :    ndb_mgm -e’ALL REPORT MEMORYUSAGE’ # in mgm

To get the EXACTLY status of the mysql cluster tables we use:

SELECT  TABLE_NAME,ROW_FORMAT,TABLE_ROWS, AVG_ROW_LENGTH, DATA_LENGTH,MAX_DATA_LENGTH, INDEX_LENGTH,DATA_FREE, AUTO_INCREMENT  FROM information_schema.tables WHERE table_schema = DATABASE();

Also With the comando ALL REPORT MEMORYUSAGE, we can check how is the data spreaded.

4) Benchmarking in mysql cluster

To execute perfoming test or other kind of test is simple like:

10 threads and each thread loops 10000 times. 1000 Rows in table visitor:
Command: bencher -s/var/lib/mysql/mysql.sock -t10 -l10000 -d visitor -q “select * from visitor”

Or more comples:

10 threads and each thread loops 10000 times. 1000 Rows in table visitor:
Command: bencher -s/var/lib/mysql/mysql.sock -t10 -l1000 -d visitor -q “Select c.* from visitor_a c join visitor a on c.count_request = a.count_request join visitor_accesso     b on c.count_requested = b.request_dura_secs where a.count_reest between 0 and 15000 and   b.http between 777 and 7777″

Well this its all.. This are only 5 topics, but there is more.. in my next post i will talk about :

  • auto-sharding , rebalance, internal programns of mysql cluster, nice & usefull commands and others..

thanks for the read,

@JamesJara

 

 

 

 

 

 

 

 

 

Posted in post