Pythian Group

Subscribe to Pythian Group feed
Love Your Data
Updated: 15 hours 8 min ago

Defining Digital Transformation

Wed, 2016-05-11 15:33

 

Terminology is important—and it’s particularly important for us to define terms that are central to what we do. So when it comes to the subject of digital transformation, what exactly are we talking about?

 

In speaking with clients and industry thought leaders, I’ve come to realize that the term “digital transformation” has a different meaning to different people. For a term that is so widely used — and that, on its surface, seems pretty straightforward — the range of interpretation is remarkable. It’s a bit like when we say “I’ll do it later.”  “Later” to one person means “before the sun goes down today.” “Later” to another person means “sometime in the future”, and it could mean days or weeks in their mind. “Later” to a third person can mean “I have no plans to do it, and this is my way of telling you nicely.”

 

Because the term is so essential to the work we do for our clients, I thought it would be helpful to define what digital transformation means to us here at Pythian. There’s so much we can say on the topic, so I plan to follow up with a series of articles about how I’ve seen it implemented, or worse, not implemented or even embraced as a concept.

 

To start, “digital transformation” is about technology. I know that to some people it isn’t, but I disagree. These days, you can’t transform your business without technology. It’s not about which technology you choose, as much as it’s about how to use it. Even more specifically, we’ve found that the businesses that are achieving positive transformation are using technology to capitalize on data. I have yet to see a single transformation project that didn’t use data as a major component of its success.

 

Let’s look at the term “transformation.” This equates to change, but it doesn’t mean change for its own sake. The change we’re talking about has to benefit the business. However, the factor that can make or break successful change is people. Their attitudes, preconceptions, and ideas almost always have to be aligned with the change for successful transformation to occur. People need to get behind the initiative, people have to fund it, people have to develop it, and people have to support it once it’s developed. And we all know that getting people to change can be more difficult than developing any new technology. In short, the transformative capabilities inherent in technology can only be realized when coupled with the willingness to embrace change.

Why Digital Transformation?

Why is the concept of digital transformation important in the first place? At Pythian, we believe that it’s about using technology and data to change your business for the better. What do we mean when we say “for the better”? Therein lies the controversy.  “For the better” means different things to different people depending on their company’s key objectives.

 

“For the better” can mean:

  • Becoming more efficient to drive costs down so your profitability can improve
  • Reducing mistakes and improving your reputation, or the quality of your product
  • Differentiating your product to get ahead of the competition
  • Doing what you do, only faster than your competitors
  • Creating new revenue streams
  • Improving the customer experience. This is a big one, so I will dedicate an entire blog post to exploring exactly what it means.

 

Digital transformation is the key to achieving any one, or all of these benefits, and knowing your objectives and priorities will help you shape your digital transformation initiative. So to start, focus less on what digital transformation is, and more on what you want the outcome of a transformation to be.

 

Categories: DBA Blogs

GoldenGate 12.2 Big Data Adapters: part 4 – HBASE

Wed, 2016-05-11 11:51

This is the next post in my series about Oracle GoldenGate Big Data adapters. Here is list of all posts in the series:

  1. GoldenGate 12.2 Big Data Adapters: part 1 – HDFS
  2. GoldenGate 12.2 Big Data Adapters: part 2 – Flume
  3. GoldenGate 12.2 Big Data Adapters: part 3 – Kafka
  4. GoldenGate 12.2 Big Data Adapters: part 4 – HBASE

In this post I am going to explore HBASE adapter for GoldenGate. Let’s start by recalling what we know about HBASE. The Apache HBASE is non-relational, distributed database. It has been modelled after the Google’s Bigtable distributed database. It can provide read write access to the data and is based on top of Hadoop or HDFS.

So, what does it tell us? First, we can write and change the data. Second, we need to remember that it is non-relation database and it is a bit of a different approach to data in comparison with traditional relation databases. You can think about HBase as about a key-value store. We are not going deep inside HBASE architecture and internals here, since our main task is to test Oracle GoldenGate adapter and see how it works. Our configuration has an Oracle database as a source with a GoldenGate extract and target system where we have Oracle GoldenGate for BigData.

We have more information about setting up the source and target in the first post in the series about HDFS adapter. The source side replication part has already been configured and started. We have initial trail file for data initialization and trails for the ongoing replication. We capture changes for all tables in the ggtest schema on the oracle database.
Now we need to prepare our target site. Let’s start from HBase. I used a pseudo-distributed mode for my tests where I ran a fully-distributed mode on a single host. It is not acceptable for any production configuration but will suffice for our tests. On the same box I have HDFS to serve as a main storage. Oracle documentation for the adapter states that they support HBase from version 1.0.x . In my first attempt I tried to use HBase version 1.0.0 (Cloudera 5.6) but it didn’t work. I got errors in the GoldenGate and my extract was aborted.
Here is the error :

 
2016-03-29 11:51:31  ERROR   OGG-15051  Oracle GoldenGate Delivery, irhbase.prm:  Java or JNI exception:
java.lang.NoSuchMethodError: org.apache.hadoop.hbase.HTableDescriptor.addFamily(Lorg/apache/hadoop/hbase/HColumnDescriptor;)Lorg/apache/hadoop/hbase/HTableDescriptor;.
2016-03-29 11:51:31  ERROR   OGG-01668  Oracle GoldenGate Delivery, irhbase.prm:  PROCESS ABENDING.

So, I installed another version HBase and the version 1.1.4 worked just fine. I used simple, standard HBase configuration for pseudo-distributed mode where region server was on the same host as master and hbase.rootdir point to local hdfs.
Here is example of configuration:

<configuration>
<property>
  <name>hbase.cluster.distributed</name>
    <value>true</value>
    </property>
    <property>
      <name>hbase.rootdir</name>
        <value>hdfs://localhost:8020/user/oracle/hbase</value>
        </property>
</configuration>
[root@sandbox conf]# cat regionservers
localhost
[root@sandbox conf]#

As soon as we have HBase setup and running we can switch our attention to GoldenGate instead. We have already a trail file with initial load. Now we need to prepare our configuration files for initial and ongoing replication. Let’s go to our GoldenGate for Big Data home directory and prepare everything. In first, we need a hbase.conf file copied from $OGG_HOME/AdapterExamples/big-data/hbase directory to $OGG_HOME/dirprm. I left everything as it used to be in the original file changing only gg.classpath parameter to point it to my configuration files and libs for HBase.
Here is an example of the configuration files:

[oracle@sandbox oggbd]$ cat dirprm/hbase.props

gg.handlerlist=hbase

gg.handler.hbase.type=hbase
gg.handler.hbase.hBaseColumnFamilyName=cf
gg.handler.hbase.keyValueDelimiter=CDATA[=]
gg.handler.hbase.keyValuePairDelimiter=CDATA[,]
gg.handler.hbase.encoding=UTF-8
gg.handler.hbase.pkUpdateHandling=abend
gg.handler.hbase.nullValueRepresentation=CDATA[NULL]
gg.handler.hbase.authType=none
gg.handler.hbase.includeTokens=false

gg.handler.hbase.mode=tx

goldengate.userexit.timestamp=utc
goldengate.userexit.writers=javawriter
javawriter.stats.display=TRUE
javawriter.stats.full=TRUE

gg.log=log4j
gg.log.level=INFO

gg.report.time=30sec

gg.classpath=/u01/hbase/lib/*:/u01/hbase/conf:/usr/lib/hadoop/client/*

javawriter.bootoptions=-Xmx512m -Xms32m -Djava.class.path=ggjava/ggjava.jar

In second, we have to prepare a parameter file for our initial load. I used a simple file with minimum parameters.

[oracle@sandbox oggbd]$ cat dirprm/irhbase.prm
-- passive REPLICAT irhbase
-- Trail file for this example is located in "./dirdat/initld" file
-- Command to add REPLICAT
-- run replicat irhbase:
-- ./replicat paramfile dirprm/irhbase.prm reportfile dirrpt/irhbase.rpt
SPECIALRUN
END RUNTIME
EXTFILE /u01/oggbd/dirdat/initld
TARGETDB LIBFILE libggjava.so SET property=dirprm/hbase.props
REPORTCOUNT EVERY 1 MINUTES, RATE
GROUPTRANSOPS 10000
MAP GGTEST.*, TARGET BDTEST.*;

Having that configuration file we can run the replicat in passive mode from command line and see the result.
Here is initial status for HBASE:

hbase(main):001:0> version
1.1.4, r14c0e77956f9bb4c6edf0378474264843e4a82c3, Wed Mar 16 21:18:26 PDT 2016

hbase(main):001:0> list
TABLE
0 row(s) in 0.3340 seconds

=> []
hbase(main):002:0>

Running the replicat:

oracle@sandbox oggbd]$ ./replicat paramfile dirprm/irhbase.prm reportfile dirrpt/irhbase.rpt
[oracle@sandbox oggbd]$

Now we have 2 tables in HBASE:

hbase(main):002:0> list
TABLE
BDTEST:TEST_TAB_1
BDTEST:TEST_TAB_2
2 row(s) in 0.3680 seconds

=> ["BDTEST:TEST_TAB_1", "BDTEST:TEST_TAB_2"]
hbase(main):003:0>

Let’s have a look to the tables structure and contains:


hbase(main):004:0> describe 'BDTEST:TEST_TAB_1'
Table BDTEST:TEST_TAB_1 is ENABLED
BDTEST:TEST_TAB_1
COLUMN FAMILIES DESCRIPTION
{NAME => 'cf', BLOOMFILTER => 'ROW', VERSIONS => '1', IN_MEMORY => 'false', KEEP_DELETED_CELLS => 'FALSE', DATA_BLOCK_ENCODING => 'NONE', TTL => 'FOREVER', COMPRESSION => 'NONE', MI
N_VERSIONS => '0', BLOCKCACHE => 'true', BLOCKSIZE => '65536', REPLICATION_SCOPE => '0'}
1 row(s) in 0.2090 seconds

hbase(main):005:0> scan 'BDTEST:TEST_TAB_1'
ROW                                            COLUMN+CELL
 1                                             column=cf:ACC_DATE, timestamp=1459269153102, value=2014-01-22:12:14:30
 1                                             column=cf:PK_ID, timestamp=1459269153102, value=1
 1                                             column=cf:RND_STR, timestamp=1459269153102, value=371O62FX
 1                                             column=cf:RND_STR_1, timestamp=1459269153102, value=RJ68QYM5
 1                                             column=cf:USE_DATE, timestamp=1459269153102, value=2014-01-24:19:09:20
 2                                             column=cf:ACC_DATE, timestamp=1459269153102, value=2014-05-11:05:23:23
 2                                             column=cf:PK_ID, timestamp=1459269153102, value=2
 2                                             column=cf:RND_STR, timestamp=1459269153102, value=371O62FX
 2                                             column=cf:RND_STR_1, timestamp=1459269153102, value=HW82LI73
 2                                             column=cf:USE_DATE, timestamp=1459269153102, value=2014-01-24:19:09:20
 3                                             column=cf:ACC_DATE, timestamp=1459269153102, value=2014-01-22:12:14:30
 3                                             column=cf:PK_ID, timestamp=1459269153102, value=3
 3                                             column=cf:RND_STR, timestamp=1459269153102, value=RXZT5VUN
 3                                             column=cf:RND_STR_1, timestamp=1459269153102, value=RJ68QYM5
 3                                             column=cf:USE_DATE, timestamp=1459269153102, value=2013-09-04:23:32:56
 4                                             column=cf:ACC_DATE, timestamp=1459269153102, value=2014-05-11:05:23:23
 4                                             column=cf:PK_ID, timestamp=1459269153102, value=4
 4                                             column=cf:RND_STR, timestamp=1459269153102, value=RXZT5VUN
 4                                             column=cf:RND_STR_1, timestamp=1459269153102, value=HW82LI73
 4                                             column=cf:USE_DATE, timestamp=1459269153102, value=2013-09-04:23:32:56
4 row(s) in 0.1630 seconds

hbase(main):006:0> scan 'BDTEST:TEST_TAB_2'
ROW                                            COLUMN+CELL
 7|IJWQRO7T|2013-07-07:08:13:52                column=cf:ACC_DATE, timestamp=1459269153132, value=2013-07-07:08:13:52
 7|IJWQRO7T|2013-07-07:08:13:52                column=cf:PK_ID, timestamp=1459269153132, value=7
 7|IJWQRO7T|2013-07-07:08:13:52                column=cf:RND_STR_1, timestamp=1459269153132, value=IJWQRO7T
1 row(s) in 0.0390 seconds

hbase(main):007:0>

Everything looks good for me. We have structure and records as expected. Let’s go forward and setup ongoing replication.
I have created a parameter file for my replicat using the the initial load parameters as a basis:

[oracle@sandbox oggbd]$ cat dirprm/rhbase.prm
REPLICAT rhbase
-- Trail file for this example is located in "dirdat/or" directory
-- Command to add REPLICAT
-- add replicat rhbase, exttrail dirdat/or
TARGETDB LIBFILE libggjava.so SET property=dirprm/hbase.props
REPORTCOUNT EVERY 1 MINUTES, RATE
GROUPTRANSOPS 10000
MAP ggtest.*, TARGET bdtest.*;

We are checking our trail files and starting our replicat using the latest trail file. By default, a replicat would be looking for a trail with sequential number 0, but, since I have a purging policy on my GoldenGate it deletes old files and I need tell to replicat where to start exactly.

[oracle@sandbox oggbd]$ ll dirdat/
total 4940
-rw-r-----. 1 oracle oinstall    3028 Feb 16 14:17 initld
-rw-r-----. 1 oracle oinstall 2015199 Mar 24 13:07 or000043
-rw-r-----. 1 oracle oinstall 2015229 Mar 24 13:08 or000044
-rw-r-----. 1 oracle oinstall 1018490 Mar 24 13:09 or000045
[oracle@sandbox oggbd]$ ggsci

Oracle GoldenGate Command Interpreter
Version 12.2.0.1.0 OGGCORE_12.2.0.1.0_PLATFORMS_151101.1925.2
Linux, x64, 64bit (optimized), Generic on Nov 10 2015 16:18:12
Operating system character set identified as UTF-8.

Copyright (C) 1995, 2015, Oracle and/or its affiliates. All rights reserved.



GGSCI (sandbox.localdomain) 1> info all

Program     Status      Group       Lag at Chkpt  Time Since Chkpt

MANAGER     RUNNING


GGSCI (sandbox.localdomain) 2> add replicat rhbase, exttrail dirdat/or,EXTSEQNO 45
REPLICAT added.


GGSCI (sandbox.localdomain) 3> start replicat rhbase

Sending START request to MANAGER ...
REPLICAT RHBASE starting


GGSCI (sandbox.localdomain) 4> info all

Program     Status      Group       Lag at Chkpt  Time Since Chkpt

MANAGER     RUNNING
REPLICAT    RUNNING     RHBASE      00:00:00      00:00:06


GGSCI (sandbox.localdomain) 5> info rhbase

REPLICAT   RHBASE    Last Started 2016-03-29 12:56   Status RUNNING
Checkpoint Lag       00:00:00 (updated 00:00:08 ago)
Process ID           27277
Log Read Checkpoint  File dirdat/or000045
                     2016-03-24 13:09:02.000274  RBA 1018490


GGSCI (sandbox.localdomain) 6>

I inserted number of rows to test_tab_1 on oracle side and all of them were successfully replicated to HBASE.

hbase(main):015:0> count 'BDTEST:TEST_TAB_1'
Current count: 1000, row: 1005694
Current count: 2000, row: 442
Current count: 3000, row: 6333
3473 row(s) in 1.0810 seconds

=> 3473
hbase(main):016:0>

Let’s have a look bit close to test_tab_1 and test_tab_2:

hbase(main):005:0> scan 'BDTEST:TEST_TAB_1'
ROW                                            COLUMN+CELL
 1                                             column=cf:ACC_DATE, timestamp=1459269153102, value=2014-01-22:12:14:30
 1                                             column=cf:PK_ID, timestamp=1459269153102, value=1
 1                                             column=cf:RND_STR, timestamp=1459269153102, value=371O62FX
 1                                             column=cf:RND_STR_1, timestamp=1459269153102, value=RJ68QYM5
 1                                             column=cf:USE_DATE, timestamp=1459269153102, value=2014-01-24:19:09:20
 2                                             column=cf:ACC_DATE, timestamp=1459269153102, value=2014-05-11:05:23:23
 2                                             column=cf:PK_ID, timestamp=1459269153102, value=2
 2                                             column=cf:RND_STR, timestamp=1459269153102, value=371O62FX
 2                                             column=cf:RND_STR_1, timestamp=1459269153102, value=HW82LI73
 2                                             column=cf:USE_DATE, timestamp=1459269153102, value=2014-01-24:19:09:20
..............................................

hbase(main):006:0> scan 'BDTEST:TEST_TAB_2'
ROW                                            COLUMN+CELL
 7|IJWQRO7T|2013-07-07:08:13:52                column=cf:ACC_DATE, timestamp=1459269153132, value=2013-07-07:08:13:52
 7|IJWQRO7T|2013-07-07:08:13:52                column=cf:PK_ID, timestamp=1459269153132, value=7
 7|IJWQRO7T|2013-07-07:08:13:52                column=cf:RND_STR_1, timestamp=1459269153132, value=IJWQRO7T
1 row(s) in 0.0390 seconds

hbase(main):007:0>

You can see that row identifier for test_tab_1 is value for pk_id and for test_tab_2 it is concatenation of all values for all columns. Why is it so? The difference is in constraints for the tables. Since we don’t have a primary key or unique index for test_tab_2 it uses all columns as a key value. We can try to add a constraint and see the result.

select * from dba_constraints where owner='GGTEST' and table_name='TEST_TAB_2';

no rows selected

alter table ggtest.test_tab_2 add constraint pk_test_tab_2 primary key (pk_id);

Table altered.

insert into ggtest.test_tab_2 values(9,'PK_TEST',sysdate,null);

1 row created.

commit;

Commit complete.

orcl>

And let us comare with result on the HBASE:

hbase(main):012:0> scan 'BDTEST:TEST_TAB_2'
ROW                                           COLUMN+CELL
 7|IJWQRO7T|2013-07-07:08:13:52               column=cf:ACC_DATE, timestamp=1459275116849, value=2013-07-07:08:13:52
 7|IJWQRO7T|2013-07-07:08:13:52               column=cf:PK_ID, timestamp=1459275116849, value=7
 7|IJWQRO7T|2013-07-07:08:13:52               column=cf:RND_STR_1, timestamp=1459275116849, value=IJWQRO7T
 8|TEST_INS1|2016-03-29:15:14:37|TEST_ALTER   column=cf:ACC_DATE, timestamp=1459278884047, value=2016-03-29:15:14:37
 8|TEST_INS1|2016-03-29:15:14:37|TEST_ALTER   column=cf:PK_ID, timestamp=1459278884047, value=8
 8|TEST_INS1|2016-03-29:15:14:37|TEST_ALTER   column=cf:RND_STR_1, timestamp=1459278884047, value=TEST_INS1
 8|TEST_INS1|2016-03-29:15:14:37|TEST_ALTER   column=cf:TEST_COL, timestamp=1459278884047, value=TEST_ALTER
 9                                            column=cf:ACC_DATE, timestamp=1462473865704, value=2016-05-05:14:44:19
 9                                            column=cf:PK_ID, timestamp=1462473865704, value=9
 9                                            column=cf:RND_STR_1, timestamp=1462473865704, value=PK_TEST
 9                                            column=cf:TEST_COL, timestamp=1462473865704, value=NULL
3 row(s) in 0.0550 seconds

hbase(main):013:0>

It is fully dynamic and changed row id column on the fly. Will it work with unique index? Yes it will :


delete from ggtest.test_tab_2 where pk_id=9;

1 row deleted.

alter table ggtest.test_tab_2 drop constraint pk_test_tab_2;

Table altered.

create unique index ggtest.ux_test_tab_2 on ggtest.test_tab_2 (pk_id);

Index created.

insert into ggtest.test_tab_2 values(10,'UX_TEST',sysdate,null);

1 row created.

commit;

Here is the newly inserted row.

hbase(main):017:0> scan 'BDTEST:TEST_TAB_2'
ROW                                           COLUMN+CELL
 10                                           column=cf:ACC_DATE, timestamp=1462474389145, value=2016-05-05:14:53:03
 10                                           column=cf:PK_ID, timestamp=1462474389145, value=10
 10                                           column=cf:RND_STR_1, timestamp=1462474389145, value=UX_TEST
 10                                           column=cf:TEST_COL, timestamp=1462474389145, value=NULL
 7|IJWQRO7T|2013-07-07:08:13:52               column=cf:ACC_DATE, timestamp=1459275116849, value=2013-07-07:08:13:52
 7|IJWQRO7T|2013-07-07:08:13:52               column=cf:PK_ID, timestamp=1459275116849, value=7

But it will not make any difference if we just create an index on the source. It will not change anything. So, if we need to identify a key for a table we have to have at least unique constraint. Of course it is just default behavior for a schema replication and we may use KEYCOLS to identify keys for some tables.

Interesting that if we change a table structure it will affect all newly inserted rows but will not change existing even if we update some values. It works by this way if you have an unique identifier and it was not changed by your DDL operation.
Here is an example. We have a column “TEST_COL” in the table test_tab_2. Let’s drop the column and update the row. Keep in mind that our primary key is column PK_ID and we are not modifying the key.

alter table ggtest.test_tab_2 drop column TEST_COL;

Table altered.

update ggtest.test_tab_2 set rnd_str_1='TEST_COL' where pk_id=9;

1 row updated.

commit;

In HBASE we can see the same set of columns:

hbase(main):030:0> scan 'BDTEST:TEST_TAB_2'
ROW                                           COLUMN+CELL
 9                                            column=cf:ACC_DATE, timestamp=1462477581440, value=2016-05-05:15:46:13
 9                                            column=cf:PK_ID, timestamp=1462477794597, value=9
 9                                            column=cf:RND_STR_1, timestamp=1462477794597, value=TEST_COL
 9                                            column=cf:TEST_COL, timestamp=1462477581440, value=NULL
1 row(s) in 0.0200 seconds

We still have the deleted column TEST_COL even we’ve updated the row.
But if we insert any new row it will have the new set of columns:

insert into ggtest.test_tab_2 values(10,'TEST_COL',sysdate);

1 row created.

commit;

Commit complete.

And in HBASE:

hbase(main):031:0> scan 'BDTEST:TEST_TAB_2'
ROW                                           COLUMN+CELL
 10                                           column=cf:ACC_DATE, timestamp=1462477860649, value=2016-05-05:15:50:55
 10                                           column=cf:PK_ID, timestamp=1462477860649, value=10
 10                                           column=cf:RND_STR_1, timestamp=1462477860649, value=TEST_COL
 9                                            column=cf:ACC_DATE, timestamp=1462477581440, value=2016-05-05:15:46:13
 9                                            column=cf:PK_ID, timestamp=1462477794597, value=9
 9                                            column=cf:RND_STR_1, timestamp=1462477794597, value=TEST_COL
 9                                            column=cf:TEST_COL, timestamp=1462477581440, value=NULL
2 row(s) in 0.0340 seconds

And, as for all other cases, truncate on source table is not going to be replicated to the target and the operation will be ignored. You have to truncate the table in HBASE by yourself to keep the data in sync. In case you insert data again the data in HBASE will be “updated”. But it will not delete other rows. It will be more like a “merge” operation.
Here is an example:

truncate table ggtest.test_tab_2;

Table truncated.

insert into ggtest.test_tab_2 values(10,'TEST_COL2',sysdate);

1 row created.

commit;

Commit complete.

select * from ggtest.test_tab_2;

	   PK_ID RND_STR_1  ACC_DATE
---------------- ---------- -----------------
	      10 TEST_COL2  05/05/16 16:01:20

orcl>

HBASE:
hbase(main):033:0> scan 'BDTEST:TEST_TAB_2'
ROW                                           COLUMN+CELL
 10                                           column=cf:ACC_DATE, timestamp=1462478485067, value=2016-05-05:16:01:20
 10                                           column=cf:PK_ID, timestamp=1462478485067, value=10
 10                                           column=cf:RND_STR_1, timestamp=1462478485067, value=TEST_COL2
 9                                            column=cf:ACC_DATE, timestamp=1462477581440, value=2016-05-05:15:46:13
 9                                            column=cf:PK_ID, timestamp=1462477794597, value=9
 9                                            column=cf:RND_STR_1, timestamp=1462477794597, value=TEST_COL
 9                                            column=cf:TEST_COL, timestamp=1462477581440, value=NULL
2 row(s) in 0.0300 seconds

hbase(main):034:0>

I spent some time testing performance and found the main bottleneck was my Oracle source rather than GoldenGate and HBASE. I was able to sustain transaction rate up to 60 DML per second and my Oracle DB started to struggle to keep pace because of waiting for a commit. The HBASE and replicat were absolutely fine. I also checked how it handles big transactions and inserted about 2 billion rows by one transaction. It worked fine. Of course it doesn’t prove that any of your production configurations will be without any performance issues. To conduct real performance tests I need to use much bigger environment.
In addition, I noticed one more minor error in Oracle documentation for adapter related to “keyValuePairDelimiter” parameter. In documentation it is replaced by “keyValueDelimiter”. It just small mistype and the “keyValueDelimiter” is repeated twice. First time it is correct and the second time it stands on the place where “keyValuePairDelimiter” is supposed to be. Here is the link.

As a summary I can say that despite some minor issues the adapters and GoldenGate for Big Data showed quite mature status and readiness for real work. I think it is good robust technology and, hopefully, its development will continue improving it with new releases. I am looking forward to use it in a real production environment with significant workload. In following posts I will try to test different DDL operations and maybe some other datatypes. Stay tuned.

Categories: DBA Blogs

Learn how to optimize text searches in SQL Server 2014 by using Full-Text Search – Part 1

Tue, 2016-05-10 09:50

In this article, we’ll cover the functionality available in SQL Server 2014 for textual research known as Full-Text Search, its installation and implementation. Additionally, we will see how to develop textual searches using the predicates CONTAINS, CONTAINSTABLE, FREETEXT and FREETEXTTABLE, and use the FILESTREAM feature to improve the research and storage of binary data in tables.

The research based on words and phrases is one of the main features of the search tools on the web, like Google, and digital document management systems. To perform these searches efficiently, many developers create highly complex applications that do not have the necessary intelligence to find terms and phrases in the columns that store text and digital documents in the database tables.

What the vast majority of these professionals don’t know is that SQL Server has an advanced tool for textual research, the Full-Text Search (FTS).

FTS has been present in SQL Server since version 7, and through use textual searches can be performed both in columns that store characters, and in columns that store documents (for example, Office documents and PDFs), in its native form.

With options like searches for words and phrases, recognition of different languages, derivation of words (for example: play, played and playing), the possibility of developing a thesaurus, the creation of ranked results, and elimination of stopwords for search, FTS becomes a powerful tool for textual searches. As main factors for the use of textual searches we have:

  • The current databases are increasingly used as repositories of digital documents;
  • The cost for storage of information has slowed considerably, enabling the storage of Gigabytes, Terabytes and even Petabytes;
  • New types of digital documents are constantly being created, and the requirements for their storage, and subsequent research, are becoming larger and more complex;
  • Developers need a robust and reliable interface for performing textual research intelligence.

FTS has great advantages over other alternatives for textual research. For example, the command LIKE. The main tasks you can perform with FTS are:

  • Textual research based on linguistics. A linguistic research is based on words or phrases in a particular language, taking into consideration the verb conjugation, derived words, accent, among other features. Unlike the LIKE predicate, FTS uses an efficient indexing structure to perform textual research;
  • Automatic removal of stopwords informed in a textual research. The following are considered stopwords ones that don’t add to the result of the survey, such as from, to, the, the, a, an;
  • Assigning weights to the terms searched, making certain words are more important than others within the same textual research;
  • Generation of prioritization, allowing a better view of the documents that are most relevant according to the research carried out;
  • Indexing and searching in the most diverse types of digital documents. With FTS you can carry out searches in text files, spreadsheets, ZIP files, among others.

In this article will describe the architecture of the FTS, your installation and configuration, the main T-SQL commands used in textual research, the use of FTS in conjunction with the FILESTREAM, and also some techniques to optimize searches through the FTS.

FTS architecture

The architecture of the FTS has several components working in conjunction with the SQL Server query processor to perform textual research efficiently. The Figure 1 illustrates the major components of the architecture of the FTS. Let’s look at some of them:

  • Client Consultation: The client application sends the textual queries to the SQL Server query processor. It is the responsibility of the client application to ensure that the textual queries are written in the right way by following the syntax of FTS;
  • SQL Server Process (sqlservr.exe): The SQL Server process contains the query processor and also the engine of the FTS, which compiles and executes the textual queries. The integration between SQL Server and process the FTS offers a significant performance boost because it allows the query processor lot more efficient execution plans for textual searches;
  • SQL Server Query Processor: The query processor has multiple subcomponents that are responsible for validating the syntax, compile, generate execution plans and execute the SQL queries;
  • Full-Text Engine: When the SQL Server query processor receives a query FTS, it forwards the request to the FTS Engine. The Engine is responsible for validating FTS the FTS query syntax, check the full-text index, and then work together with the SQL Server query processor to return the textual search results;
  • Indexer: The indexer works in conjunction with other components to populate the full-text index;
  • Full-Text Index: The full-text index contains the most relevant words and their respective positions within the columns included in the index;
  • Stoplist: A stoplist is a list of stopwords for textual research. The indexer stoplist query during the indexing process and implementation of textual research to eliminate the words that don’t add value to the survey. SQL Server 2014 stores the stoplists within the database itself, thus facilitating their administration;
  • Thesaurus: The thesaurus is an XML file (stored externally to the database) in which you can define a list of synonyms that can be used for the textual research. The thesaurus must be based on the language that will be used in the search. The full-text engine reads the thesaurus file at the time of execution of research to verify the existence of synonyms that can increase the quality and comprehensiveness of the same;
  • Filter daemon host (fdhost.exe): Is responsible for managing the processes of filtering, word breaker and stemmer;
  • SQL Full-Text Filter Daemon Launcher (fdlauncher.exe): Is the process that starts the Filter daemon host (Fdhost.exe) when the full-text engine needs to use some of the processes managed by the same.

FTS1

Figure 1. Architecture of FTS.

For the better understanding of the process of creation, use and maintenance of the structure of full-text indexes, you must also know the meaning of some important concepts. They are:

  • Term: The word, phrase or character used in textual research;
  • Full-Text Catalog: A group of full-text indexes;
  • Word breaker: The process that is the barrier every word in a sentence, based on the grammar rules of the language selected for the creation of full-text index;
  • Token: A word, phrase or character defined by the word breaker;
  • Stemmer: The process that generates different verb forms for the words, based on the grammar rules of the language selected for the creation of full-text index;
  • Filter: Component responsible for extracting textual information from documents stored with the data type varbinary(max) and send this information to the process word breaker.
Indexing process

The indexing process is responsible for the initial population of a full-text index and update of this index when the data modifications occur on the columns that have been indexed by FTS. This initialization process and update the full-text index named crawl.

When the crawl process is started, the FTS component known as protocol handler accesses the data in the table being indexed and begins the process to load into memory the existing content in this table, also known as streaming. To have access to data which are stored on disk, the protocol handler allows FTS to communicate with the Storage Engine. After the end of streaming the filter daemon host process performs data filtering, and initiates the processes of word breaker and stemmer for the filling in of the full-text index.

During the indexing process the stoplist is queried to remove stopwords, and so fill the structure of the full-text index with words that are meaningful to the textual research. The last step of the indexing process is known as a master merge, in which every word indexed are grouped in a single full-text index.

Despite the indexing process requires a high i/o consumption, it is not necessary to the blocking of the data being indexed. However a query performed using a full-text index during the indexing process can generate a result incomplete.

Full-Text query processing

For the full-text query processing are used the same words and phrases limiters that were defined by the Word breaker during the indexing process. You can also use additional components, as for example, the stemmer and the thesaurus, depending on the full-text predicates (CONTAINS or FREETEXT) used in the textual research. The use of full-text predicates will be discussed later in this article.

The process stemmer generates inflectional forms of the searched words. For example, from the term “play” is searched also the terms “played”, “play”, “play” beyond the term itself “play”.

Through rules created in the thesaurus file you can use synonyms to replace or expand the searched terms. For example, when performing a textual search using the term “Ruby”, the full-text engine can replace it by the synonym “red”, or else expand the research considering the terms automatically “red”, “wine”, “Scarlet” and also “Ruby”.

After processing of the full-text query, the full-text engine provides information to SQL query processor that assist in creating an execution plan optimized for textual research. There is a greater integration between the full-text engine and the query processor of SQL (both are components of the SQL Server process), enabling textual searches are conducted in a more optimized.

In the next post of this 4 part series, we will learn how to install the FTS and how to use it. Stay tuned!

Categories: DBA Blogs

Did You Know: Oracle EBS R12.2 #1 – Managing OHS

Mon, 2016-05-09 15:21

For a long time now I’ve wanted to start a blog series outlining the not-so-obvious things that have changed in the new Oracle E-Business Suite R12.2. Here comes the new “Did You Know” series specifically for Oracle E-Business Suite! Lets start this series with Apache, aka Oracle HTTP Server.

People are already aware that OHS10g/OC4J is replaced with OHS11g/Weblogic in R12.2.X. On the surface it looks like a simple change, but a lot changed under the hood. In Oracle EBS 11i/R12.1, one could change apache port, but just updating the Context XML file and running autoconfig.

In R12.2, we have to change OHS stuff like web port by logging into the EM console url and then running $AD_TOP/bin/adSyncContext.pl script to sync OHS config parameters to the Context XML file. This script is only needed for OHS config for now. Any changes for Weblogic als need to be done in weblogic console url, not in Context XML file. But these changes in weblogic console are automatically propagated to xml file by the adRegisterWLSListeners.pl script.

You can get more details of the procedures by reviewing the MOS notes below.

  • E-Business Suite 12.2 Detailed Steps To Change The R12.2 Default Port To 80 (Doc ID 2072420.1)
  • Managing Configuration of Oracle HTTP Server and Web Application Services in Oracle E-Business Suite Release 12.2 (Doc ID 1905593.1)
Categories: DBA Blogs

MySQL InnoDB’s Full Text Search overview

Fri, 2016-05-06 12:56

NOTE: If you want to read and play with the interactive application, please go to the shinnyapps article. It has been developed using Shiny/R in order to allow you to see the effects of the algorithms.

Thanks to Valerie Parham-Thompson at Pythian and Daniel Prince at Oracle.

Github repository contains the code to generate and load the data and also, the Shiny/R code.

Some initial thoughts

A couple of weeks ago one of our customers came up with a question regarding FTS over InnoDB engine. Although the question is not answered in the current article, I came up with the conclusion that FTS is sometimes misunderstood.

The point of this article is to show dynamically how the search algorithms work, using non-fictional data (data sources were downloaded from Gutenberg project within an easy interface (please see at the bottom of the ShinnyApps post here) .

In order to show the effects off the field sizes over the query expansion algorithm, you will see two main tables (bookContent and bookContentByLine) both containing the same books in different approaches: by line and by paragraph. You’ll see the noise generated by the QUERY EXPANSION algorithm when phrases are too large.

For the sake of simplicity, in this article we won’t go through the FTS parsers. That is possible material for a future post.

Why I consider FTS sometimes misunderstood?

FTS is a technology that can be use for any purpose, not only simple searches. Generally, FTS engines are placed to work as a service for web or document searches, which generally require technologies like Solr, ElasticSearch or Sphinx. However, certain bussines rules require complex searches, and having such feature inside RDBMS can be a win.

RDBMS aren’t a good place for massive amount of FTS queries, without using any of the join capabilities that they offer, or the ACID properties.

As I said above, FTS is totally acceptable in RDBMS, if you are using at least one RDBMS main feature, required by your bussines model.

Action!

To start showing the effects of the algorithms, the following example searches the word ‘country’ using query expansion. This means that we are not looking only the exact matches, but also the entries that appear the most when the the exact match has been found.

In the SELECT clause you’ll see both FTS expressions using NATURAL LANGUAGE with query expansion and BOOLEAN modes respectively.

View the code on Gist.

The noise generated by the query expansion is expected and described in the official documentation here.

The interesting case is the following row, which has 2 exact occurrences (you can see the positions 1 and 63) and it is not the highest rank using query extension. Remember, this is expected.


Text: "country districts. As Lucca had five gates, he divided his own country"
bookid: 1232
pos: 1,63
QERank: 80
BoolRank: 14

This is even worse when using large sentences. In the example bellow you will see the same query, against the table storing by paragraph. The boolean rank shows some of the entries way above others, however the query extension locates at the top records that not necessarily has a lot of exact matches.

View the code on Gist.

The query expansion is useful when you intend to search which entries contain more words that appear frequently within the search term. Having large text fields increase the probability to have more words that appear among the search term. In the case of bookContent table (by paragraph table), the average field size is 443.1163 characters.

The INNODB_FT_INDEX_TABLE

There is a way to play with the contents of the FTS indexes. As you may noticed in the previous examples, I used the set global innodb_ft_aux_table = 'ftslab/bookContent'; statement, which loads the index content to memory for an easy querying.

If you use RDS, the option innodb_ft_aux_table is not available as it is GLOBAL and require SUPER privileges.

i.e. You can easily get the most frequent tokens:

View the code on Gist.

We can query the index contents with a simple SQL statement like the following:

View the code on Gist.

In the example shown before the is no intention to compare ranks score as they are based in different algorithms. The idea there is to show that QUERY EXPANSION can have non desire results in some cases due to its mechanism.

Building custom stopwords

It probably isn’t very useful information as most of these words appears too frequently and are modal verbs, adverbs, pronouns, determiners, etc. It could be the case that you are not interested on indexing those words. If that’s the case you can add them as stopwords in your own stopwords table. Specially if you are more interested in boolean searches, loosing some part of the language expressions.

We can build a custom stopwords table based on our current data:

View the code on Gist.

Let’s build our stopwords table using both default and new entries and keeping the alphabetical order:

View the code on Gist.

The idea behind choosing our own stopwords is to measure how much index do we safe filtering those words that are extremely frequent and don’t add a necessary meaning to the search. This topic could be covered in a separate blog post.

Going ahead on choosing stop words

The full article is amazingly interesting. In brief, it says that the most frequent word will occur approximately twice as often as the second most frequent word, three times as often as the third most frequent word, and so on (rank-frequency distribution is an inverse relation).

Considerations and recommendations

– Use QUERY EXPANSION only if you are interested in searching relations over exact matches. Remember that the field
size is crucial when using this.
– FTS is not the best fit for exact string matches in single columns. You don’t want to use FTS for searching emails in a single column, name and lastname fields , i.e. For those, you’ll probably use other techniques as reverse searches , exact match operator (=) or hashing (CRC32 for emails or large texts smaller than 255 characters).
– Keep your FTS indexes short. Do not add ALL the text columns. Parse first from your application the user search and adapt the query.
– If you are using BOOLEAN MODE, you can use the rank score to filter rows. MySQL is clever enough to optimize the
FTS functions to avoid double executions. You can do this using something like: match(content,title) against ("first (third)") > 1 . Generally, scores lower than 1 can be ignored when using boolean or natural mode searches.
– `OPTIMIZE TABLE` does a rebuild of the table. To avoid this, set innodb_optimize_fulltext_only=1 in order to do an incremental maintance on the table.
– Recall that NATURAL LANGUAGE MODE does not take the operands as the BOOLEAN MODE. This affects the ranking score (try +bad (thing) i.e.)
– If you plan to order by rank, it is not necessary to specify the clause `ORDER BY` as InnoDB does the order after retrieve the doc ids . Also,the behavior is different from the default as it returns the heaviest at the top (like an ORDER BY rank DESC).
– If you come from MyISAM’s FTS implementation, recall that the ranking scoring is different.
– Create the FULLTEXT index after the data is loaded InnoDB Bulk Load. When restoring FTS backups, you will probably hit the “ERROR 182 (HY000) at line nn: Invalid InnoDB FTS Doc ID”.
– Try to avoid using use more than one FTS expression in the where clause. Keep in mind that this affects the order in the results and it consumes a considerably amount of CPU. InnoDB orders by the latest expression in the WHERE clause. WL#7123.
– Also, if avoiding the rank information in the projection (SELECT clause) and using other aggregations like count(*), will use the “no ranking” FT_hints. The LIMIT hint won’t be used if invoked explicitly an ORDER BY and the MATCH clause in the projection.

View the code on Gist.

– If you plan to use FTS_DOC_ID column with AUTO_INCREMENT option, have in mind that there is a limitation regarding this. You must declare a single column PRIMARY KEY constraint or as an UNIQUE index. Also, the data type is stricted as `bigint unsigned`. i.e:

View the code on Gist.

FT_QUERY_EXPANSION_LIMIT

This variable controls the number of top matches when using `WITH QUERY EXPANSION` (affects only MyISAM). Reference.

Bug 80347 – Invalid InnoDB FTS Doc ID


emanuel@3laptop ~/sandboxes/rsandbox_5_7_9 $ ./m dumpTest < full.dump
ERROR 182 (HY000) at line 73: Invalid InnoDB FTS Doc ID

emanuel@3laptop ~/sandboxes/rsandbox_5_7_9 $ ./m dumpTest < ddl.dump
emanuel@3laptop ~/sandboxes/rsandbox_5_7_9 $ ./m dumpTest < onlyData.dump
emanuel@3laptop ~/sandboxes/rsandbox_5_7_9 $ ./m dumpTest < full.dump
ERROR 182 (HY000) at line 73: Invalid InnoDB FTS Doc ID

mysqldump is not very clever if you use `FTS_DOC_ID`:


2016-02-13T22:11:53.125300Z 19 [ERROR] InnoDB: Doc ID 10002 is too big. Its difference with largest used Doc ID 1 cannot exceed or equal to 10000

It takes dumps without considering the restriction coded in `innobase/row/row0mysql.cc`:


Difference between Doc IDs are restricted within
4 bytes integer. See fts_get_encoded_len()

The fix to this is backuping the table by chunks of 10000 documents.

Other useful links

Fine tuning
Performance
Maintenance: innodb_optimize_fulltext_only
Writing FTS parser plugins

Categories: DBA Blogs

InnoDB flushing and Linux I/O

Thu, 2016-05-05 12:06

Since documentation is not very clear to me on the topic of InnoDB flushing in combination with Linux IO (specifically the write system call), I decided to put together this article in hopes of shedding some light on the matter.

How Linux does I/O

By default, the write() system call returns after all data has been copied from the user space file descriptor into the kernel space buffers. There is no guarantee that data has actually reached the physical storage.

The fsync() call is our friend here. This will block and return only after the data and metadata (e.g. file size, last update time) is completely transferred to the actual physical storage.

There is also fdatasync() which only guarantees the data portion will be transferred, so it should be faster.

There are a few options that we can specify at file open time, that modify the behaviour of write():

O_SYNC

In this case, the write() system call will still write data to kernel space buffers, but it will block until the data is actually transferred from the kernel space buffers to the physical storage. There is no need to call fsync() after.

O_DIRECT

This completely bypasses any kernel space buffers, but requires that the writes are the same size as the underlying storage block size (usually 512 bytes or 4k). By itself, it does not guarantee that the data is completely transferred to the device when the call returns.

O_SYNC + O_DIRECT

As stated above, we would need to use both options together guarantee true synchronous IO.

Relation with InnoDB flushing

Innodb_flush_method parameter controls which options will be used by MySQL when writing to files:

At the time of this writing, we have the following options:

NULL

This is the default value, and is equivalent to specifying fsync option.

fsync

Both data and redo log files will be opened without any special options, and fsync() will be used when the db needs to make sure the data is transferred to the underlying storage.

O_DSYNC

This one is confusing, as O_DSYNC us actually replaced with O_SYNC within the source code before calling open(). It is mentioned this is due to some problems on certain Unix versions. So O_SYNC will be used to open the log files, and no special options for the datafiles. This means fsync() needs to be used to flush the data files only.

O_DIRECT

Data files are opened with O_DIRECT. Log files are opened with no extra options. Some filesystems (e.g. XFS) do not guarantee metadata without the fsync() call, so it is still used as safety measure.

O_DIRECT_NO_FSYNC

InnoDB uses O_DIRECT during flushing I/O, but skips the fsync() system call afterwards. This can provide some performance benefits if you are using a filesystem that does not require the fsync() to sync metadata.

I am deliberately not mentioning the experimental options littlesync and nosync.

There is also an extra option in Percona Server:

ALL_O_DIRECT

It uses O_DIRECT to open the log files and data files and uses fsync() to flush both the data and the log files.

Which InnoDB flushing method should I use?

The general consensus if you have a battery backed write cache or fast IO subsystem (e.g. SSD’s) is to use the O_DIRECT method. However it is a better practice to run tests to determine which method provides a better performance for each particular environment.

 

One downside of using O_DIRECT is that it requires the innodb-buffer-pool-size to be configured correctly. For example, if you accidentally leave your buffer pool size at the default value of 128M, but have 16G of RAM, the buffer pool contents will at least sit in the filesystem cache. This will not be true if you have O_DIRECT enabled (I would like to thank Morgan Tocker for his contribution regarding this section of the post).

 

 

Categories: DBA Blogs

Log Buffer #472: A Carnival of the Vanities for DBAs

Thu, 2016-05-05 09:14

This Log Buffer Edition takes into account blog posts from Oracle, SQL Server and MySQL.

Oracle:

Enterprise Manager Support Files 101- The EMOMS files

From time to time we see a complaint on OTN about the stats history tables being the largest objects in the SYSAUX tablespace and growing very quickly.

Delphix replication and push button cloud migration

PS360: A Utility to Extract and Present PeopleSoft Configuration and Performance Data

Contemplating Upgrading to OBIEE 12c?

SQL Server:

Modifying the SQL Server Model System Database to Customize New Database Settings

Azure SQL Database Elastic Database Jobs

SQL Server Resource Governor

Add a Custom Index in Master Data Services 2016

Unified Approach to Generating Documentation for PowerShell Cmdlets

MySQL:

Writing SQL that works on PostgreSQL, MySQL and SQLite

MariaDB MaxScale 1.4.2 GA is available for download

MariaDB ColumnStore, a new beginning

Planets9s – Watch the replay: Become a MongoDB DBA (if you’re re really a MySQL user)

Upgrading to MySQL 5.7, focusing on temporal types

Categories: DBA Blogs

How to Deal with MetaData Lock

Thu, 2016-05-05 08:59
What is MetaData Lock?

MySQL uses metadata locking to manage concurrent access to database objects, and to ensure data consistency when performing modifications to the schema: DDL operations. Metadata locking applies not just to tables, but also to schemas and stored programs (procedures, functions, triggers, and scheduled events).

In this post I am going to cover metadata locks on tables and triggers, that are usually seen by DBAs during regular operations/maintenance.

Kindly refer to these 4 different connections to MySQL Instance:

Screen Shot 2016-04-19 at 2.58.52 pm

 

The screenshot shows that the uncommitted transaction may cause metadata lock to ALTER operations. The ALTER will not proceed until the transaction is committed or rolled-back. What is worse, after the ALTER is issued, any queries to that table (even simple SELECT queries) will be blocked. If the ALTER operation is an ONLINE DDL operation available in 5.6+, queries will proceed as soon as the ALTER begins.

Refer to this video tutorial on MySQL Metadata Locks for further context.

These days we have a “DBAs” favourite tool “pt-online-schema-change” (osc). Let’s have a look what will happen If we run osc instead of ALTER.

Screen Shot 2016-04-19 at 3.07.26 pm

OSC gets stuck at metadata lock at the point of creating triggers on table.

Let’s jump on the second topic how can we mitigate MDL issues:

Mitigating the MetaData Lock Issues

There are various solutions to tackling MDL:

  1. Appropriate setting of wait_timeout variable which will kill stuck/sleep threads after a certain time.
  2. Configure pt-kill to get rid of stuck/sleep threads  
  3. Fix code where transactions are not committed after performing DB queries
How to kill Sleep Connections in RDS which are causing MDL

If you are on RDS and your MySQL is having bunch of Sleep threads and you don’t know which connection is causing metadata lock, then you have to kill all the Sleep queries which are in mysql for more than a certain time. As we know “kill thread_id” is not permitted in RDS, but you can use the query below to get the exact queries to kill Sleep threads.

Example Output:

mysql> SELECT CONCAT('CALL mysql.rds_kil ( ',id,')',';') FROM INFORMATION_SCHEMA.PROCESSLIST WHERE COMMAND='Sleep' AND TIME > 10 ;
+---------------------------------------------+
| CONCAT('CALL mysql.rds_kill ( ',id,')',';') |
+---------------------------------------------+
| CALL mysql.rds_kill ( 5740758); |
| CALL mysql.rds_kill ( 5740802); |
| CALL mysql.rds_kill ( 5740745); |
| CALL mysql.rds_kill ( 5740612); |
| CALL mysql.rds_kill ( 5740824); |
| CALL mysql.rds_kill ( 5740636); |
| CALL mysql.rds_kill ( 5740793); |
| CALL mysql.rds_kill ( 5740825); |
| CALL mysql.rds_kill ( 5740796); |
| CALL mysql.rds_kill ( 5740794); |
| CALL mysql.rds_kill ( 5740759); |
| CALL mysql.rds_kill ( 5740678); |
| CALL mysql.rds_kill ( 5740688); |
| CALL mysql.rds_kill ( 5740817); |
| CALL mysql.rds_kill ( 5740735); |
| CALL mysql.rds_kill ( 5740818); |
| CALL mysql.rds_kill ( 5740831); |
| CALL mysql.rds_kill ( 5740795); |
| CALL mysql.rds_kill ( 4926163); |
| CALL mysql.rds_kill ( 5740742); |
| CALL mysql.rds_kill ( 5740797); |
| CALL mysql.rds_kill ( 5740832); |
| CALL mysql.rds_kill ( 5740751); |
| CALL mysql.rds_kill ( 5740760); |
| CALL mysql.rds_kill ( 5740752); |
| CALL mysql.rds_kill ( 5740833); |
| CALL mysql.rds_kill ( 5740753); |
| CALL mysql.rds_kill ( 5740722); |
| CALL mysql.rds_kill ( 5740723); |
| CALL mysql.rds_kill ( 5740724); |
| CALL mysql.rds_kill ( 5740772); |
| CALL mysql.rds_kill ( 5740743); |
| CALL mysql.rds_kill ( 5740744); |
| CALL mysql.rds_kill ( 5740823); |
| CALL mysql.rds_kill ( 5740761); |
| CALL mysql.rds_kill ( 5740828); |
| CALL mysql.rds_kill ( 5740762); |
| CALL mysql.rds_kill ( 5740763); |
| CALL mysql.rds_kill ( 5740764); |
| CALL mysql.rds_kill ( 5740773); |
| CALL mysql.rds_kill ( 5740769); |
| CALL mysql.rds_kill ( 5740770); |
| CALL mysql.rds_kill ( 5740771); |
| CALL mysql.rds_kill ( 5740774); |
| CALL mysql.rds_kill ( 5740784); |
| CALL mysql.rds_kill ( 5740789); |
| CALL mysql.rds_kill ( 5740790); |
| CALL mysql.rds_kill ( 5740791); |
| CALL mysql.rds_kill ( 5740799); |
| CALL mysql.rds_kill ( 5740800); |
| CALL mysql.rds_kill ( 5740801); |
| CALL mysql.rds_kill ( 5740587); |
| CALL mysql.rds_kill ( 5740660); |
+---------------------------------------------+
53 rows in set (0.02 sec)
  1. Capture sql queries to kill Sleep threads

mysql -htest-server.us-west-2.rds.amazonaws.com. –skip-column-names -e ‘SELECT CONCAT(“CALL mysql.rds_kill ( “,id,”)”,”;”) FROM INFORMATION_SCHEMA.PROCESSLIST WHERE COMMAND=”Sleep” AND TIME > 10’ > kill_sleep_threads.sql

2.Execute queries from mysql prompt

mysql -htest-server.us-west-2.rds.amazonaws.com.

mysql> source kill_sleep_threads.sql
Improvements in MySQL 5.7 related to MDL

Generally, we would want to kill as few connections as possible. But the trouble with metadata locks prior to 5.7 is that there is no insight available into which threads are taking the metadata lock. In MySQL 5.7, there are several improvements in getting insight into metadata lock information.

The Performance Schema now exposes metadata lock information:

  • Locks that have been granted (shows which sessions own which current metadata locks)
  • Locks that have been requested but not yet granted (shows which sessions are waiting for which metadata locks).
  • Lock requests that have been killed by the deadlock detector or timed out and are waiting for the requesting session’s lock request to be discarded

This information enables you to understand metadata lock dependencies between sessions. You can see not only which lock a session is waiting for, but which session currently holds that lock.

The Performance Schema now also exposes table lock information that shows which table handles the server has open, how they are locked, and by which sessions.

To check who holds the metadata lock in MySQL 5.7, We have to enable global_instrumentation and wait/lock/metadata/sql/mdl.

Below is the example to enable global_instrumentation and wait/lock/metadata/sql/mdl

mysql> UPDATE performance_schema.setup_consumers SET ENABLED = 'YES' WHERE NAME = 'global_instrumentation';

Query OK, 0 rows affected (0.00 sec)

Rows matched: 1  Changed: 0  Warnings: 0

mysql> UPDATE performance_schema.setup_instruments SET ENABLED = 'YES' WHERE NAME = 'wait/lock/metadata/sql/mdl';

Query OK, 1 row affected (0.00 sec)

Rows matched: 1  Changed: 1  Warnings: 0

Once global_instrumentation and wait/lock/metadata/sql/mdl are enable, below query will show the locks status on connections.

 

mysql> SELECT OBJECT_TYPE, OBJECT_SCHEMA, OBJECT_NAME, LOCK_TYPE, LOCK_STATUS, THREAD_ID, PROCESSLIST_ID, PROCESSLIST_INFO FROM performance_schema.metadata_locks INNER JOIN performance_schema.threads ON THREAD_ID = OWNER_THREAD_ID WHERE PROCESSLIST_ID <> CONNECTION_ID();
+-------------+---------------+-------------+---------------------+-------------+-----------+----------------+------------------------------------------+
| OBJECT_TYPE | OBJECT_SCHEMA | OBJECT_NAME | LOCK_TYPE | LOCK_STATUS | THREAD_ID | PROCESSLIST_ID | PROCESSLIST_INFO |
+-------------+---------------+-------------+---------------------+-------------+-----------+----------------+------------------------------------------+
| TABLE | sbtest | sbtest1 | SHARED_READ | GRANTED | 29 | 4 | NULL |
| GLOBAL | NULL | NULL | INTENTION_EXCLUSIVE | GRANTED | 30 | 5 | alter table sbtest1 add key idx_pad(pad) |
| SCHEMA | sbtest | NULL | INTENTION_EXCLUSIVE | GRANTED | 30 | 5 | alter table sbtest1 add key idx_pad(pad) |
| TABLE | sbtest | sbtest1 | SHARED_UPGRADABLE | GRANTED | 30 | 5 | alter table sbtest1 add key idx_pad(pad) |
| TABLE | sbtest | sbtest1 | EXCLUSIVE | PENDING | 30 | 5 | alter table sbtest1 add key idx_pad(pad) |
| TABLE | sbtest | sbtest1 | SHARED_READ | PENDING | 31 | 6 | select count(*) from sbtest1 |
+-------------+---------------+-------------+---------------------+-------------+-----------+----------------+------------------------------------------+
6 rows in set (0.00 sec)

Here PROCESSLIST_ID 4 is GRANTED and other PROCESSLIST_IDs are in PENDING state.

Conclusion

Best-practice when running any DDL operation, even with performance schema changes in 5.7, it to make sure to check processlist for presence of MDL waits, check SHOW ENGINE INNODB STATUS for long active transactions. Kill DDL operation while resolving the MDL issue so as to prevent query pileup. For a temporary fix implement pt-kill or wait_timeout. Review and fix application code/scripts for any uncommitted transactions to solve metadata lock issue.

Categories: DBA Blogs

Transparent Data Encryption for SQL Server in an Availability Group

Tue, 2016-05-03 13:24

With the all new features in SQL Server 2016 always on, which you can read up on here, it’s easy to forget about Transparent Data Encryption (TDE). This blog post will focus on TDE.

TDE encrypts database files at rest. What this means is your .MDF and .NDF Files, and consequently your backups, will be encrypted, meaning you will not be able to detach the database files and restore them on another server unless that server has the same certificate that was used to encrypt the database.

In this blog post I am using SQL Server 2014 and will explain how to enable TDE on an existing AG Group database

  1. The first thing we need to check is if the server has a master encryption key on all replica in the AG Group
USE MASTER
GO
SELECT * FROM sys.symmetric_keys
WHERE name = '##MS_DatabaseMasterKey##'

The Screenshot below shows I don’t have a key so I need to create one

No Master Encryption Key

  1. Create a Database Master Encryption Key on each of the replicas in the AG Group. It is important to use a complex password

CREATE MASTER KEY ENCRYPTION BY PASSWORD = 'C&amp;mpl£xP@$$Wrd'
GO

  1. Run the code in step 1 and this time you should see the below

Master Encryption Key

  1. Now we need to create a certificate to use for the encryption of the database on the primary replica. This can be accomplished by using the below

CREATE CERTIFICATE BackupEncryptionCert
WITH SUBJECT = 'SQL Server 2014 AdventureWorks2012 Encryption Certificate';
GO

  1. Validate the Certificate

SELECT name, pvt_key_encryption_type_desc, thumbprint FROM sys.certificates

Validate Encryption Key

The thumbprint will be useful because when a database is encrypted, it will indicate the thumbprint of the certificate used to encrypt the Database Encryption Key.  A single certificate can be used to encrypt more than one Database Encryption Key, but there can also be many certificates on a server, so the thumbprint will identify which server certificate is needed

  1. Next We need to backup the certificate on the Primary Replica

BACKUP CERTIFICATE BackupEncryptionCert
TO FILE = ‘C:\BackupCertificates\BackupEncryptionCert.bak’
WITH PRIVATE KEY ( FILE = ‘C:\BackupCertificates\BackupEncryptionCertKey.bak’ ,
ENCRYPTION BY PASSWORD = ‘Certi%yC&amp;mpl£xP@$$Wrd’)

Encryption Files

The BACKUP CERTIFICATE command will create two files.  The first file is the server certificate itself.  The second file is a “private key” file, protected by a password. Both files and the password will be used to restore the certificate onto other instances.

  1. The Files created in step 6 needs to be copied to each of the other replicas and created in SQL Server. After the files are copied the below command can be used to create the certificates

CREATE CERTIFICATE BackupEncryptionCert
FROM FILE = ‘C:\BackupCertificates\BackupEncryptionCert.bak’
WITH PRIVATE KEY (FILE = ‘C:\BackupCertificates\BackupEncryptionCertKey.bak’,
DECRYPTION BY PASSWORD = ‘Certi%yC&amp;mpl£xP@$$Wrd’);

  1. That’s all the configuration needed for each instance now we are ready to start encrypting the database. We now need to tell SQL Server which Encryption Type we want to use and which certificate to use. This can be done using the following code on the Primary Replica

Use Adventureworks2012
GO
CREATE DATABASE ENCRYPTION KEY
WITH ALGORITHM = AES_256
ENCRYPTION BY SERVER CERTIFICATE BackupEncryptionCert

  1. Finally, the last step is to enable TDE by executing the below command on the Primary Replica

ALTER DATABASE AdventureWorks2012 SET ENCRYPTION ON

 

And that’s it, I hope you enjoyed this tutorial and found it informative. If you have any questions, please comment below.

Categories: DBA Blogs

Reserved words usage in MySQL

Mon, 2016-05-02 15:07

It is not uncommon to come across MySQL databases where reserved words are in use as identifiers for any kind of database objects.

Perhaps when the application schema was implemented, the words were not reserved yet, and they became reserved later on a subsequent MySQL release.

It is a good practice to check reserved words usage prior to doing any database upgrades, as any newly reserved keywords will cause syntax errors on the new version.

This is usually not a problem if proper quoting is used for referencing the objects, as described on the official manual page.

The actual steps to do this depend on the environment; for example, the following can be configured to tell Hibernate to escape identifiers:

property name="hibernate.globally_quoted_identifiers" value="true"

This does not appear to be documented properly (there is an open bug unresolved at the time of this writing).

However, we cannot make the assumption that all application code is properly escaped to deal with this kind of issues.

So what are the symptoms?

Error 1064 will be reported while trying to use a reserved word:

mysql> CREATE TABLE interval (begin INT, end INT);
ERROR 1064 (42000): You have an error in your SQL syntax ...
near 'interval (begin INT, end INT)'
How can we check for reserved words?

The following procedure can help you find out if any particular MySQL version’s reserved words are in use:

  1. Using the list on the corresponding manual page, create a text file with one reserved word on each line
  2. Load data into a temporary table
     USE test;
    CREATE TABLE reserved_words VARCHAR(50); 
    LOAD DATA INFILE 'reserved_words.txt' INTO TABLE test.reserved_words;
    
  3. Check for any column names using reserved keywords
    SELECT table_schema, table_name, column_name, ordinal_position 
    FROM information_schema.columns
    WHERE table_schema NOT IN ( 'mysql', 'information_schema', 'performance_schema' ) 
    AND column_name = ANY ( SELECT * FROM test.reserved_words ) 
    ORDER BY 1,2,4;
  4. Check for any table names using reserved keywords
    SELECT table_schema, table_name
    FROM information_schema.tables
    WHERE table_schema NOT IN ( 'mysql', 'information_schema', 'performance_schema' ) 
    AND table_name = ANY ( SELECT * FROM test.reserved_words );
  5. Check for any procedures or functions
    SELECT routine_schema, routine_name, routine_type
    FROM information_schema.routines
    WHERE routine_schema NOT IN ( 'mysql', 'information_schema', 'performance_schema' )
    AND routine_name = ANY ( select * from test.words );

I hope this post helps you avoid one of the many issues you may encounter during the challenging task of database upgrades.

Categories: DBA Blogs

Leaving Behind the Limits of Binary Thinking for Full Inclusiveness

Fri, 2016-04-29 15:39

diversityaward

When Pythian became the first tech company in Canada to release gender-based metrics last November, we wanted to make a bold statement with the launch of the Pythia Program. And apparently it worked. We’ve already increased the amount of female applicants by more than 10% over just one quarter. Our internal Pythia Index has also risen 3% from 56% to 59%. And just this week, Pythian’s CEO Paul Vallée received the WCT Diversity Champion award in recognition of his leadership and efforts to promote diversity in the workplace, and a more inclusive tech industry that promotes men and women from all backgrounds.

Despite a clear case for gender parity, and research confirming the financial return for companies, full inclusion is still ‘controversial’ to implement. A lot of this has to do with the unconscious associations we still have with male and female roles which are placed in opposition. This kind of binary thinking is rampant, especially in our social constructions of what constitutes masculinity and femininity.

When the Pythia Program was in its early stages, we actually noticed a lot of binary, either/or thinking was shaping our assumptions. Off/On. 0/1. We can do this OR that. We can empower women technical professionals OR talk to employees about unconscious bias. We can take a stand on gender diversity OR maintain good relationships with male colleagues. Wait a minute…why can’t we do both?

If we had continued to believe our choices were that limited, it would have seriously eroded any impetus to act on our values of gender equity and inclusiveness. It was time to reframe our thinking, and that’s when we stopped compromising. A bolder stance emerged when we did away with limited, binary thinking that was trapping us in false dichotomies.

Let’s look at this from a data perspective, because that’s what we love and do best.

Current computer chips store information in electrical circuits as binary bits, either in a state of 0 or 1, so there’s a finite amount of data that can be processed. Quantum computer chips, or ‘qubits’ however, can be in the state of 0, 1, or both at the same time–giving quantum computers mind-blowing processing power.

So if we apply this idea of ‘binary’ vs. ‘quantum’ into a human context, could we potentially become quantum thinkers? Quantum thinking would be holistic, and enable the mind to function at a greater level of complexity. It’s an unlimited approach that ‘either/or’ binary thinking simply does not permit. Wouldn’t it be more exciting to break away from these limitations and move to a higher, more innovative level? Things look different when this binary thinking is disrupted. Start by replacing either/or with ‘and’.

We can help achieve gender parity AND we can achieve diversity in other important areas. Pythian can be inclusive, people-focused AND financially strong. Men can be powerful leaders AND feminist.

There is one big exception, one area where it’s either/or: whether you support the status quo of tech’s current ‘bro culture’, or inclusive leadership that embraces the value of multiple perspectives. Those two states cannot co-exist.

As he accepted his award for Diversity Champion at the WCT Gala on April 27, Pythian CEO Paul Vallée made his position clear “To the women who are working hard in high tech, and who are marginalized by bro culture — which is a real problem, we are in the midst of a culture war — I salute you and keep fighting the good fight because we will prevail. To the male leaders that have taken sides in this battle, the Pythia Index will help you keep score, whether you’re on my team [fighting to end bro culture] or the opposite team.”

As Einstein said, “you can’t solve problems with the same thinking used to create them.” And lack of gender diversity in the tech industry is a problem Pythian wants to help solve.

Categories: DBA Blogs

A Practitioner’s Assessment: Digital Transformation

Thu, 2016-04-28 13:49

 

Rohinee Mohindroo is a guest blogger on Pythian Business Insights.

 

trans·for·ma·tion/ noun: a thorough or dramatic change in form or appearance

The digital transformation rage continues into 2016 with GE, AT&T, GM, Domino’s, Flex, and Starbucks, to name a few. So what’s the big deal?

Technical advances continue to progress at a rapid rate. Digital transformation simply refers to the rate at which the technological trends are embraced by an individual, organization or team.

Organizational culture and vocabulary are leading indicators of the digital transformation maturity level.

blogimagerohinee

Level 1: Business vs. Tech (us vs. them). Each party is fairly ignorant of the value and challenges of the other. Each blames the other for failures and takes credit for successes. Technology is viewed as a competency with a mandate to enable the business.

Level 2: Business and Tech (us and them). Each party is aware of the capability and challenges of the other. Credit for success is shared, failure is not discussed publicly or transparently. Almost everyone  is perceived to be technically literate with a desire to deliver business differentiation.

Level 3: Business is Tech (us). Notable awareness of the business model and technology capabilities and opportunities throughout the organization. Success is expected and failure is an opportunity. The organization is relentlessly focused on learning from customers and partners with a shared goal to continually re-define the business.

Which level best describes you or your organization? Please share what inhibits your organization from moving to the next level.

 

Categories: DBA Blogs

Log Buffer #471: A Carnival of the Vanities for DBAs

Thu, 2016-04-28 10:14

This Log Buffer Edition covers Oracle, SQL Server and MySQL blog posts of the week.

Oracle:

Improving PL/SQL performance in APEX

A utility to extract and present PeopleSoft Configuration and Performance Data

No, Oracle security vulnerabilities didn’t just get a whole lot worse this quarter.  Instead, Oracle updated the scoring metric used in the Critical Patch Updates (CPU) from CVSS v2 to CVSS v3.0 for the April 2016 CPU.  The Common Vulnerability Score System (CVSS) is a generally accepted method for scoring and rating security vulnerabilities.  CVSS is used by Oracle, Microsoft, Cisco, and other major software vendors.

Oracle Cloud – DBaaS instance down for no apparent reason

Using guaranteed restore points to navigate through time

SQL Server:

ANSI SQL with Analytic Functions on Snowflake DB

Exporting Azure Data Factory (ADF) into TFS Source Control

Getting started with Azure SQL Data Warehouse

Performance Surprises and Assumptions : DATEADD()

With the new security policy feature in SQL Server 2016 you can restrict write operations at the row level by defining a block predicate.

MySQL:

How to rename MySQL DB name by moving tables

MySQL 5.7 Introduces a JSON Data Type

Ubuntu 16.04 first stable distro with MySQL 5.7

MariaDB AWS Key Management Service (KMS) Encryption Plugin

MySQL Document Store versus Bug hunter

Categories: DBA Blogs

How to recover space from already deleted files

Wed, 2016-04-27 14:15

Wait, what? Deleted files are gone, right? Well, not so if they’re currently in use, with an open file handle by an application. In the Windows world, you just can’t touch it, but under Linux (if you’ve got sufficient permissions), you can!

Often in the Systems Administration, and Site Reliability Engineering world, we will encounter a disk space issue being reported, and there’s very little we can do to recover the space. Everything is critically important! We then check for deleted files and find massive amounts of space consumed when someone has previously deleted Catalina, Tomcat, or Weblogic log files while Java had them in use, and we can’t restart the processes to release the handles due to the critical nature of the service. Conundrum!

Here at Pythian, we Love Your Data, so I thought I’d share some of the ways we deal with situations like this.

How to recover

First, we grab a list of PIDs with files still open, but deleted. Then iterate over the open file handles, and null them.

PIDS=$(lsof | awk '/deleted/ { if ($7 > 0) { print $2 }; }' | uniq)
for PID in $PIDS; do ll /proc/$PID/fd | grep deleted; done

This could be scripted in an automatic nulling of all deleted files, with great care.

Worked example

1. Locating deleted files:

[root@importantserver1 usr]# lsof | head -n 1 ; lsof | grep -i deleted
 COMMAND   PID   USER   FD  TYPE DEVICE SIZE/OFF NODE   NAME
 vmtoolsd  2573  root   7u  REG  253,0  9857     65005  /tmp/vmware-root/appLoader-2573.log (deleted)
 zabbix_ag 3091  zabbix 3wW REG  253,0  4        573271 /var/tmp/zabbix_agentd.pid (deleted)
 zabbix_ag 3093  zabbix 3w  REG  253,0  4        573271 /var/tmp/zabbix_agentd.pid (deleted)
 zabbix_ag 3094  zabbix 3w  REG  253,0  4        573271 /var/tmp/zabbix_agentd.pid (deleted)
 zabbix_ag 3095  zabbix 3w  REG  253,0  4        573271 /var/tmp/zabbix_agentd.pid (deleted)
 zabbix_ag 3096  zabbix 3w  REG  253,0  4        573271 /var/tmp/zabbix_agentd.pid (deleted)
 zabbix_ag 3097  zabbix 3w  REG  253,0  4        573271 /var/tmp/zabbix_agentd.pid (deleted)
 java      23938 tomcat 1w  REG  253,0  0        32155  /opt/log/tomcat/catalina.out (deleted)
 java      23938 tomcat 2w  REG  253,0  45322216 32155  /opt/log/tomcat/catalina.out (deleted)
 java      23938 tomcat 9w  REG  253,0  174      32133  /opt/log/tomcat/catalina.2015-01-17.log (deleted)
 java      23938 tomcat 10w REG  253,0  57408    32154  /opt/log/tomcat/localhost.2016-02-12.log (deleted)
 java      23938 tomcat 11w REG  253,0  0        32156  /opt/log/tomcat/manager.2014-12-09.log (deleted)
 java      23938 tomcat 12w REG  253,0  0        32157  /opt/log/tomcat/host-manager.2014-12-09.log (deleted)
 java      23938 tomcat 65w REG  253,0  363069   638386 /opt/log/archive/athena.log.20160105-09 (deleted)

2. Grab the PIDs:

[root@importantserver1 usr]# lsof | awk '/deleted/ { if ($7 > 0) { print $2 }; }' | uniq
 2573
 3091
 3093
 3094
 3095
 3096
 3097
 23938

Show the deleted files that each process still has open (and is consuming space):

[root@importantserver1 usr]# export PIDS=$(lsof | awk '/deleted/ { if ($7 > 0) { print $2 }; }' | uniq)
[root@importantserver1 usr]# for PID in $PIDS; do ll /proc/$PID/fd | grep deleted; done
 lrwx------ 1 root root 64 Mar 21 21:15 7 -> /tmp/vmware-root/appLoader-2573.log (deleted)
 l-wx------ 1 root root 64 Mar 21 21:15 3 -> /var/tmp/zabbix_agentd.pid (deleted)
 l-wx------ 1 root root 64 Mar 21 21:15 3 -> /var/tmp/zabbix_agentd.pid (deleted)
 l-wx------ 1 root root 64 Mar 21 21:15 3 -> /var/tmp/zabbix_agentd.pid (deleted)
 l-wx------ 1 root root 64 Mar 21 21:15 3 -> /var/tmp/zabbix_agentd.pid (deleted)
 l-wx------ 1 root root 64 Mar 21 21:15 3 -> /var/tmp/zabbix_agentd.pid (deleted)
 l-wx------ 1 root root 64 Mar 21 21:15 3 -> /var/tmp/zabbix_agentd.pid (deleted)
 l-wx------ 1 tomcat tomcat 64 Mar 21 21:15 1 -> /opt/log/tomcat/catalina.out (deleted)
 l-wx------ 1 tomcat tomcat 64 Mar 21 21:15 10 -> /opt/log/tomcat/localhost.2016-02-12.log (deleted)
 l-wx------ 1 tomcat tomcat 64 Mar 21 21:15 11 -> /opt/log/tomcat/manager.2014-12-09.log (deleted)
 l-wx------ 1 tomcat tomcat 64 Mar 21 21:15 12 -> /opt/log/tomcat/host-manager.2014-12-09.log (deleted)
 l-wx------ 1 tomcat tomcat 64 Mar 21 21:15 2 -> /opt/log/tomcat/catalina.out (deleted)
 l-wx------ 1 tomcat tomcat 64 Mar 21 21:15 65 -> /opt/log/archive/athena.log.20160105-09 (deleted)
 l-wx------ 1 tomcat tomcat 64 Mar 21 21:15 9 -> /opt/log/tomcat/catalina.2015-01-17.log (deleted)

Null the specific files (here, we target the catalina.out file):

[root@importantserver1 usr]# cat /dev/null > /proc/23938/fd/2
Alternative ending

Instead of deleting the contents to recover the space, you might be in the situation where you need to recover the contents of the deleted file. If the application still has the file descriptor open on it, you can then recover the entire file to another one (dd if=/proc/23938/fd/2 of=/tmp/my_new_file.log) – assuming you have the space to do it!

Conclusion

While it’s best not to get in the situation in the first place, you’ll sometimes find yourself cleaning up after someone else’s good intentions. Now, instead of trying to find a window of “least disruption” to the service, you can recover the situation nicely. Or, if the alternative solution is what you’re after, you’ve recovered a file that you thought was long since gone.

Categories: DBA Blogs

Deploy Docker containers using AWS Opsworks

Wed, 2016-04-27 13:51
Introduction

This post is about how to deploy Docker containers on AWS using Opsworks and Docker Composer.
For AWS and Docker, the introduction isn’t required. So, let’s quickly introduce Opsworks and Docker Composer.

Opsworks

Opsworks is a great tool provided by AWS, which runs Chef recipes on your Instances. If the instance is an AWS instance, you don’t pay anything for using Opsworks, but you can also manage instances outside of AWS with a flat cost just by installing the Agent and registering the instance on Opsworks.

Opsworks Instances type

We have three different types of instances on Opsworks:

1. 24x7x365
Run with no stop

2. Time based
Run in a predefined time. Such as work hours.

3. Load based
Scale up and down according to the metrics preconfigured.

You can find more details here.

Custom JSON

Opsworks provides Chef Databags (variables to be used in your recipes) via Custom JSON, and that’s the key to this solution. We will manage everything just changing a JSON file. This file can become a member of your development pipeline easily.

Life cycle

Opsworks has five life cycles:
1. Setup
2. Configure
3. Deploy
4. Undeploy
5. Shutdown
We will use setup, deploy, and shutdown. You can find more details about Opsworks life cycle here.

Docker Compose

Docker Compose was originally developed under the Fig project. Nowadays, the fig is deprecated, and docker-compose is a built-in component of Docker.
Using docker-compose, you can manage all containers and their attributes (links, share volumes, etc.) in a Docker host. Docker-compose can only manage containers on the local host where it is deployed. It cannot orchestrate Docker containers between hosts.
All configuration is specified inside of a YML file.

Chef recipes

Using Opsworks, you will manage all hosts using just one small Chef cookbook. All the magic is in translating Custom JSON file from Opsworks to YML file to be used by docker-compose.
The cookbook will install all components (Docker, pip, and docker-compose), translate Custom JSON to YML file and send commands to docker-compose.

Hands ON

Let’s stop talking and see things happen.

We can split it into five steps:

  1. Resources creation
    1. Opsworks Stack
        1. Log into your AWS account
        2. Go to Services -> Management Tools -> Opsworks
          Accessing Opsworks menu
        3. Click on Add stack (if you already have stacks on Opsworks) or Add your first stack (if it’s the first time you are creating stacks on opsworks)
        4. Select type Chef 12 stack
          Note: The Chef cookbook used in this example only supports Chef12
        5. Fill out stack information
          aws_opsworks_docker_image02
          Note:
          – You can use any name as stack name
          – Make sure VPC selected are properly configured
          – This solution supports Amazon Linux and Ubuntu
          – Repository URL https://bitbucket.org/tnache/opsworks-recipes.git
        6. Click on advanced if you want to change something. Changing “Use OpsWorks security groups” to No can be a good idea when you need to communicate with instances which are running outside of Opsworks
        7. Click on “Add stack”
    2. Opsworks layer
        1. Click on “Add a layer”
        2. Set Name, Short name and Security groups. I will use webserver

      Note:
      Use a simple name because we will use this name in next steps
      The Name web is reserved for AWS internal use

        1. Click on “Add layer”

      aws_opsworks_docker_image03

    3. Opsworks Instance
        1. Click on “Instances” on left painel
        2. Click on “Add an instance”
        3. Select the size (instance type)
        4. Select the subnet
        5. Click on “Add instance”

      aws_opsworks_docker_image05

  2. Resources configuration
    1. Opsworks stack
        1. Click on “Stack” on left painel
        2. Click on “Stack Settings”
        3. Click on “Edit”
        4. Find Custom JSON field and paste the content of the file bellow

      custom_json_1

      1. Click on “Save”
    2. Opsworks layer
        1. Click on “Layers” on left painel
        2. Click on “Recipes”
        3. Hit docker-compose and press enter on Setup
        4. Hit docker-compose::deploy and press enter on Deploy
        5. Hit docker-compose::stop and press enter on Deploy
        6. Click on “Save”

      aws_opsworks_docker_image04

  3. Start
    1. Start instance
        1. Click on start

      aws_opsworks_docker_image06

  4. Tests
    Note: Wait until instance get online state

      1. Open your browser and you should be able to see It works!
      2. Checking running containers

    aws_opsworks_docker_image07

  5. Management
      1. Change custom json to file bellow (See resources configuration=>Opsworks stack)

    custom_json_2

      1. Click on “Deployments” on left painel
      2. Click on “Run Command”
      3. Select “Execute Recipes” as “Command”
      4. Hit “docker-compose::deploy” as “Recipes to execute”
      5. Click on “Execute Recipes”

    Note: Wait until deployment finish

      1. Checking running containers

    aws_opsworks_docker_image08

Categories: DBA Blogs

Percona Live Data Performance Conference 2016 Retrospective

Tue, 2016-04-26 08:39

 

Last week the annual Percona Live Data Performance Conference was held in Santa Clara, California. This conference is a great time to catch up with the industry, and be exposed to new tools and methods for managing MySQL and MongoDB.

Sessions

The highlights from this year’s sessions and tutorials centered around a few technologies:

  • The typical sessions for Galera Cluster and Performance Schema are always getting better, along with visualization techniques.
  • Oracle MySQL’s new Document Store blurs the line between RDBMS and NoSQL.
  • Facebook’s RocksDB is getting smaller and faster.
  • ProxySQL, the new proxy kid on the block, promises to address MySQL scalability issues.
  • If security is a concern, which it should be, Hashicorp’s Vault project would be something to look into for managing MySQL secrets or encrypting data in transit.
  • MongoDB was a hot topic as well, with a number of sessions addressing management of environments and design patterns.

I expect to see an influx of articles regarding ProxySQL and MySQL’s Document Store in the next few months.

Networking

The evenings were also great events for networking and socializing, giving attendees the chance to rub shoulders with some of the most successful ‘WebScale’ companies to hear stories from the trenches. Events included the Monday Community Networking Reception and Wednesday’s Game Night.

Thank you to all those who attended the Annual Community Dinner at Pedro’s organized by Pythian on Tuesday night! We had a blast and we hope you did as well.

Community Dinner At Pedro's

Thank you!

Pythian sponsored and provided a great range of sessions this year, and we want to thank all those who stopped by our booth or attended our sessions.

I’d like to give a huge shout-out to Percona for continuing to organize a high-quality MySQL user conference focused on solving some of the toughest technical issues that can be thrown at us, and an equal shout-out to the other sponsors and speakers that play a huge part in making this conference happen.

I am looking forward to what PerconaLive Europe will bring this fall, not to mention what we can expect next year when Percona Live Santa Clara rolls around again.

Categories: DBA Blogs

Improve Parsing and Query Performance – Fix Oracle’s Fixed Object Statistics

Mon, 2016-04-25 20:50

What do I mean by ‘fix’ the the fixed object statistics?  Simply gather statistics to help the optimizer.

What are ‘fixed objects’?  Fixed objects are the x$ tables and associated indexes that data dictionary views are based on.  In this case we are interested in the objects that make up the v$sqlstats and v$sql_bind_capture views.

If you’ve never before collected statistics on Oracle fixed object, you may be wondering why you should bother with it, as everything appears to be fine in your databases.

After seeing an example you may want to schedule a time to collect these statistics.

Searching for SQL

Quite recently I was looking for recently executed SQL, based on the most recently captured bind variables.

select  sql_id, sql_fulltext
from v$sqlstats
where sql_id in (
   select  distinct sql_id
   from (
      select sql_id, last_captured
      from (
         select sql_id, last_captured
         from V$SQL_BIND_CAPTURE
         order by last_captured desc nulls last
      )
      where rownum <= 20
   )
)

I ran the query and was distracted for a few moments.  When I next looked at the terminal session where this SQL was executing, no rows had yet been returned.

Thinking that maybe ‘SET PAUSE ON’ had been run, I pressed ENTER.  Nothing.

From another session I checked for waits in v$session_wait.  Nothing there either.  If the session is not returning rows, and not registering and event in v$session_wait, then it must be on CPU.

This didn’t seem an ideal situation, and so I stopped the query with CTRL-C.

The next step was to run the query on a smaller and not very busy 11.2.0.2 database.  This time I saw that rows were being returned, but very slowly.

So now it was time to trace the execution and find out what was going on.

alter session set tracefile_identifier='JKSTILL';

set linesize 200 trimspool on

alter session set events '10046 trace name context forever, level 12';

select  sql_id, sql_fulltext
from v$sqlstats
where sql_id in (
   select  distinct sql_id
   from (
      select sql_id, last_captured
      from (
         select sql_id, last_captured
         from V$SQL_BIND_CAPTURE
         order by last_captured desc nulls last
      )
      where rownum <= 20
   )
)
/

alter session set events '10046 trace name context off';

exit

Coming back to this several minutes later, the resulting trace file was processed with the Method R Profiler to find out just where the time was going.

 

image

 

The ‘SQL*Net message from client’ event can be ignored; most of that time was accumulated waiting for me to come back and exit sqlplus.  While the script example shows that the 10046 trace was turned off and the session exited, I had forgot to include those two line for this first run.

No matter, as the interesting bit is the next line, ‘CPU: FETCH dbcalls’.  More than 6 minutes was spent fetching a few rows, so clearly something was not quite right. The SQL plan in the profile showed what the problem was, as the execution plan was far less than optimal. The following is the execution plan from AWR data:

 

  1  select *
  2  from TABLE(
  3     dbms_xplan.display_awr(sql_id => :sqlidvar, plan_hash_value => 898242479, format => 'ALL ALLSTATS LAST')
  4*    )
sys@oravm1 SQL- /

PLAN_TABLE_OUTPUT
-----------------------------------------------------------------------------------------------------------------
SQL_ID 4h7qfxa9t1ukz
--------------------
select  sql_id, sql_fulltext from v$sqlstats where sql_id in (  select
distinct sql_id         from (          select sql_id, last_captured            from (
   select sql_id, last_captured from V$SQL_BIND_CAPTURE order by
last_captured desc nulls last           )               where rownum <= 20      ) )

Plan hash value: 898242479

-------------------------------------------------------------------------------------------
| Id  | Operation                 | Name         | E-Rows |E-Bytes| Cost (%CPU)| E-Time   |
-------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT          |              |        |       |     1 (100)|          |
|   1 |  FILTER                   |              |        |       |            |          |
|   2 |   FIXED TABLE FULL        | X$KKSSQLSTAT |      1 |  2023 |     0   (0)|          |
|   3 |   VIEW                    |              |      1 |     8 |     1 (100)| 00:00:01 |
|   4 |    COUNT STOPKEY          |              |        |       |            |          |
|   5 |     VIEW                  |              |      1 |     8 |     1 (100)| 00:00:01 |
|   6 |      SORT ORDER BY STOPKEY|              |      1 |    43 |     1 (100)| 00:00:01 |
|   7 |       FIXED TABLE FULL    | X$KQLFBC     |      1 |    43 |     0   (0)|          |
-------------------------------------------------------------------------------------------

PLAN_TABLE_OUTPUT
-----------------------------------------------------------------------------------------------------------------

Query Block Name / Object Alias (identified by operation id):
-------------------------------------------------------------

   1 - SEL$88122447
   2 - SEL$88122447 / X$KKSSQLSTAT@SEL$4
   3 - SEL$6        / from$_subquery$_002@SEL$5
   4 - SEL$6
   5 - SEL$FEF91251 / from$_subquery$_003@SEL$6
   6 - SEL$FEF91251
   7 - SEL$FEF91251 / X$KQLFBC@SEL$10

Note
-----
   - Warning: basic plan statistics not available. These are only collected when:
       * hint 'gather_plan_statistics' is used for the statement or
       * parameter 'statistics_level' is set to 'ALL', at session or system level


39 rows selected.

 

While useful, this plan is not giving much information about why this took so long.  If pressed I would just whip up a Bash and Awk one-liner  to parse the trace files and find out where this time was going.  In this case though I could just consult the Method R profile again.

 

image

 

Yikes!  There were 106.3E6 rows returned from from X$KQLFBC.

Collecting the Fixed Object Statistics

Rather than spend time analyzing this further, it seemed that here was a clear case for collecting statistics on fixed objects in the database.  The following SQL was run:

 

exec dbms_stats.gather_fixed_objects_stats

 

The next step was to rerun the query.  This time it ran so quickly I wondered if it had even worked.  As before, tracing had been enabled, and a profile generated from the trace. There was now quite an improvement seen in the execution plan:

 

image

 

The 0.0047 seconds required to return 442 rows from X$KQLFBC was quite a reduction from the previously seen time of nearly 396 seconds.

Why This Is Important

This issue came to light due to a custom query I was running. The optimizer will probably never run that same query, but it was clear that the fixed object statistics needed to be updated.

Now imagine your customers using your application; they may be waiting on the database for what seems like an eternity after pressing ENTER on a web form.  And what are they waiting on?  They may be waiting on the optimizer to evaluate a SQL statement and determine the best plan to use.  The reason for the waiting in this case would simply be that the DBA has not taken steps to ensure the optimizer has the correct information to effectively query the database’s own metadata.   Until the optimizer has the correct statistics, performance of query optimization will be sub-optimal.  In a busy system this may result in mutex waits suddenly showing as a top event in AWR reports.  Troubleshooting these waits can be difficult as there are many possible causes of them.

Do your customers, your database and yourself a favor – include updates of fixed tables statistics in your regular database maintenance schedule, and avoid a possible source of performance problems.

Categories: DBA Blogs

When the default value is not the same as the default

Mon, 2016-04-25 11:39

I was working on a minor problem recently where controlfile autobackups were written to the wrong location during rman backups. Taking controlfile autobackups is generally a good idea, even if you configure controlfile backups yourself. Autobackups also include an spfile backup, though not critical for restore, is still convenient to have. And autobackups are taken not only after backups, but more importantly every time you change the physical structure of your database, like adding or removing datafiles and tablespaces which would make a restore with an older controlfile a lot harder.

What happened in this case was that the CONTROLFILE AUTOBACKUP FORMAT parameter was changed from the default ‘%F’ to the value ‘%F’. Yes, the values are the same. But setting a value and not leaving it at the default changed the behaviour of those autobackups. Where by default ‘%F’ means writing to the flash recovery area, explicitly setting the format parameter to ‘%F’ will save the autobackup to the folder $ORACLE_HOME/dbs/.

See for yourself. This shows an autobackup while the parameter is set to the default and as expected, the autobackup is written to the flash recovery area. So that is the correct location but the filename is a bit off. It should be c-DBID-YYYYMMDD-SERIAL.

RMAN> SHOW CONTROLFILE AUTOBACKUP FORMAT;

RMAN configuration parameters for database with db_unique_name CDB1 are:
CONFIGURE CONTROLFILE AUTOBACKUP FORMAT FOR DEVICE TYPE DISK TO '%F'; # default

RMAN> backup spfile;

Starting backup at 18-APR-16
using channel ORA_DISK_1
channel ORA_DISK_1: starting full datafile backup set
channel ORA_DISK_1: specifying datafile(s) in backup set
including current SPFILE in backup set
channel ORA_DISK_1: starting piece 1 at 18-APR-16
channel ORA_DISK_1: finished piece 1 at 18-APR-16
piece handle=/u01/app/oracle/fast_recovery_area/CDB1/backupset/2016_04_18/o1_mf_nnsnf_TAG20160418T172428_ckb62f38_.bkp tag=TAG20160418T172428 comment=NONE
channel ORA_DISK_1: backup set complete, elapsed time: 00:00:01
Finished backup at 18-APR-16

Starting Control File and SPFILE Autobackup at 18-APR-16
piece handle=/u01/app/oracle/fast_recovery_area/CDB1/autobackup/2016_04_18/o1_mf_s_909509070_ckb62gko_.bkp comment=NONE
Finished Control File and SPFILE Autobackup at 18-APR-16

Now we are setting the to format string to ‘%F’ and observe the autobackup is not written to the FRA but $ORACLE_HOME/dbs. At least it has the filename we were expecting.

RMAN> CONFIGURE CONTROLFILE AUTOBACKUP FORMAT FOR DEVICE TYPE DISK TO '%F';

new RMAN configuration parameters:
CONFIGURE CONTROLFILE AUTOBACKUP FORMAT FOR DEVICE TYPE DISK TO '%F';
new RMAN configuration parameters are successfully stored

RMAN> backup spfile;

Starting backup at 18-APR-16
using channel ORA_DISK_1
channel ORA_DISK_1: starting full datafile backup set
channel ORA_DISK_1: specifying datafile(s) in backup set
including current SPFILE in backup set
channel ORA_DISK_1: starting piece 1 at 18-APR-16
channel ORA_DISK_1: finished piece 1 at 18-APR-16
piece handle=/u01/app/oracle/fast_recovery_area/CDB1/backupset/2016_04_18/o1_mf_nnsnf_TAG20160418T172447_ckb62z7f_.bkpx tag=TAG20160418T172447 comment=NONE
channel ORA_DISK_1: backup set complete, elapsed time: 00:00:01
Finished backup at 18-APR-16

Starting Control File and SPFILE Autobackup at 18-APR-16
piece handle=/u01/app/oracle/product/12.1.0.2/db_1/dbs/c-863887021-20160418-04 comment=NONE
Finished Control File and SPFILE Autobackup at 18-APR-16

RMAN> SHOW CONTROLFILE AUTOBACKUP FORMAT;

RMAN configuration parameters for database with db_unique_name CDB1 are:
CONFIGURE CONTROLFILE AUTOBACKUP FORMAT FOR DEVICE TYPE DISK TO '%F';

This is like Schrödinger’s parameter, where you can either get the correct location or the correct name, but not both. To be fair, not assigning the right name to the autobackup in the FRA does not matter much because the files will be found during a restore anyway.

At this point it is good to remember how to use CLEAR to reset a parameter to it’s default instead of just setting the default value.

RMAN> CONFIGURE CONTROLFILE AUTOBACKUP FORMAT FOR DEVICE TYPE DISK CLEAR;

old RMAN configuration parameters:
CONFIGURE CONTROLFILE AUTOBACKUP FORMAT FOR DEVICE TYPE DISK TO '%F';
RMAN configuration parameters are successfully reset to default value

RMAN> SHOW CONTROLFILE AUTOBACKUP FORMAT;

RMAN configuration parameters for database with db_unique_name CDB1 are:
CONFIGURE CONTROLFILE AUTOBACKUP FORMAT FOR DEVICE TYPE SBT_TAPE TO '%F'; # default
CONFIGURE CONTROLFILE AUTOBACKUP FORMAT FOR DEVICE TYPE DISK TO '%F'; # default

I have tested this in versions 10g, 11g and 12.1.0.2 with the same result. The behaviour is also not unknown. In fact, bug 4248670 was logged against this in 2005 but has not been resolved so far. My Oracle Support does mention the above workaround of clearing the parameter in note 1305517.1 though.

Categories: DBA Blogs

MySQL Query Best Practices

Mon, 2016-04-25 11:30

You can get many returns from a Google search for “MySQL Query Best Practices” or “MySQL Query Optimization.” The drawback is that too many rules can provide confusing or even conflicting advice. After doing some research and tests, I outlined the essential and important ones below:

1) Use proper data types

1.1) Use the smallest data types if possible

MySQL tries to load as much data as possible into memory (innodb-buffer-pool, key-buffer), so a small data type means more rows of data in memory, thus improving performance. Also, small data sizes reduces disk i/o.

1.2) Use Fixed-length Data Types if Possible

MySQL can calculate quickly the position of a fixed-length column in a specific row of a table.

With the flexible-length data type, the row size is not fixed, so every time it needs to do a seek, MySQL might consult the primary key index. However, the flexible-length data type can save data size, and the disk space required.

In practice, if the column data size varies a lot, then use a flexible-length data type (e.g., varchar); if the data length is short or length barely changes, then use a fixed data type.

1.3) Use not null unless there is reason not to

It is harder for MySQL to optimize queries that refer to nullable columns, because they make indexes, index statistics, and value comparisons more complicated. A nullable column uses more storage space and requires special processing inside MySQL.

When a nullable column is indexed, it requires an extra byte per entry and can even cause a fixed-size index (e.g., an index on a single integer column) to be converted to a variable-sized one in MyISAM.

2)Use indexes smartly

2.1) Make primary keys short and on meaningful fields

A shorter primary key will benefit your queries, because the smaller your primary key, the smaller the index, and the less pages in the cache. In addition, a numeric type is prefered because numeric types are stored in a much more compact format than character formats and so it will make primary key shorter.

Another reason to make primary key shorter, is because we usually use primary key to join with the other tables.

It is a good idea to use a primary key on a meaningful field, because MySQL uses a cluster index on a primary key. We usually just need the info from primary key, and especially when joined with other tables, it will only search in the index without reading from the data file in disk, and benefit the performance. When you use a meaningful field as the primary key, make sure the uniqueness on the fields wouldn’t change, otherwise it might affect all the tables using this as foreign key when you have to change the primary key.

2.2) Index on the search fields only when needed

Usually we add indexes on the fields that frequently show up in a where clause — that is the purpose of indexing. But while an index will benefit reads, it can make writes slower (inserting/updating), so index only when you need it and index smartly.

2.3) Index and use the same data types for join fields

MySQL can do joins on different data types, but the performance is poor as it has to convert from one type to the other for each row. Use the same data type for join fields when possible.

2.4) Use a composite index if your query has has more than one field in the where clause

When the query needs to search on multiple columns of a table, it might be a good idea to create a compound index for those columns. This is because with composite index on multiple columns, the search will be able to narrow down the result set by the first column, then the second, and so on.

Please note that the order of the columns in the composite index affects the performance, so put the columns in the order of the efficiency of narrowing down the search.

2.5) Covering index for most commonly used fields in results

In some cases, we can put all the required fields into an index (i.e., a covering index) with only some of the fields in the index used for searching and the others for data only. This way, MySQL only need to access the index and there is no need to search in another table.

2.6) Partial index for long strings or TEXT, BLOB data types by index on prefix

There is a size limitation for indexes (by default, 1000 for MyISAM, 767 for InnoDB). If the prefix part of the string already covers most of the unique values, it is good to just index the prefix part.

2.7) Avoid over-indexing

Don’t index on the low cardinality values, MySQL will choose a full table scan instead of use index if it has to scan the index more than 30%.

If a field already exists in the first field of a composite index, you may not need an extra index on the single field. If it exists in a composite index but not in the leftmost field, you will usually need a separate index for that field only if required.

Bear in mind that indexing will benefit in reading data but there can be a cost for writing (inserting/updating), so index only when you need it and index smartly.

3) Others
3.1) Avoid SELECT *
There are many reasons to avoid select * from… queries. First, it can waste time to read all the fields if you don’t need all the columns. Second, even if you do need all columns, it is better to list the all the field names, to make the query more readable. Finally, if you alter the table by adding/removing some fields, and your application uses select * queries, you can get unexpected results.

3.2) Prepared Statements
Prepared Statements will filter the variables you bind to them by default, which is great for protecting your application against SQL injection attacks.

When the same query is being used multiple times in your application, you can assign different values to the same prepared statement, yet MySQL will only have to parse it once.

3.3) If you want to check the existence of data, use exists instead SELECT COUNT

To check if the data exists in a table, using select exists (select *…) from a table will perform better than select count from a table, since the first method will return a result once it gets one row of the required data, while the second one will have to count on the whole table/index.

3.4) Use select limit [number]

Select… limit [number] will return the only required lines of rows of data. Including the limit keyword in your SQL queries can have performance improvements.

3.5) Be careful with persistent connections

Persistent connections can reduce the overhead of re-creating connections to MySQL. When a persistent connection is created, it will stay open even after the script finishes running. The drawback is that it might run out of connections if there are too many connections remaining open but in sleep status.

3.6) Review your data and queries regularly

MySQL will choose the query plan based on the statistics of the data in the tables. When the data size changes, the query plan might change, and so it is important to check your queries regularly and to make optimizations accordingly. Check regularly by:

3.6.1) EXPLAIN your queries

3.6.2) Get suggestions with PROCEDURE ANALYSE()

3.6.3) Review slow queries

Categories: DBA Blogs

Proud to Work at Pythian, One of Canada’s Top 25 ICT Professional Services Companies

Fri, 2016-04-22 13:18

It’s only four months into 2016, and there’s a lot to be excited about. In addition to moving Pythian’s global headquarters in Ottawa, Canada to the hip and happening neighbourhood of Westboro, we’ve been receiving accolades for being one of Canada’s top ICT professional services companies, and a great place to work. Following are three reasons to be proud to work at Pythian.

In April Pythian was recognized as one of Canada’s Top 25 Canadian ICT Professional Services Companies on the prestigious Branham300 list. We also appeared on the Top 250 Canadian ICT Companies list for the second year in a row.

The Branham300 is the definitive listing of Canada’s top public and private ICT companies, as ranked by revenues. Not too many people can say that they work at a company that is one of the Top 25 ICT Professional Services Companies in Canada.

In February, our CEO Paul Vallée was named “Diversity Champion of the Year” by Women in Communications and Technology (WCT). In 2015 Pythian launched the Pythia Project, a corporate initiative designed to increase the percentage of talented women who work and thrive at Pythian, especially in tech roles. A new metric called the “Pythia Index” was also introduced. It measures the proportion of people in a business, or in a team, who are women leaders or report to a woman leader. Pythian was also the first Canadian tech company to release its gender stats, and invite other Canadian tech companies to join in the battle against “bro culture”. Stay tuned for more news on the Pythia program in the coming months.

And last, but not least, in March, Pythian was selected as one of Canada’s Top Small & Medium Employers for 2016. This award recognizes small and medium employers with exceptional workplaces and forward-thinking human resource policies. Everyone that works at Pythian is aware of the amazing benefits, but there is a hard working team that really goes the extra mile to make the company a great place to work. Thank you.

Clearly 2016 is off to a fantastic start! I’m looking forward to more good news to share.

Categories: DBA Blogs

Pages