Skip navigation.

Kubilay Çilkara

Syndicate content
Database Systems is a blog about Databases, Oracle, Salesforce and Data IntegrationKubilay Tsil Kara
Updated: 1 hour 52 min ago

The Importance of Data Virtualization

Fri, 2015-09-04 11:37
It’s been a long time since the way data was stored and accessed has been addressed. We went from scrolls to books to mainframes and this last method hasn’t budged all that much over the last decade or so. This is despite the fact that we keep creating more and more information, which means that better ways for storing and finding it would make a world of difference. Fortunately, this is where data virtualization comes into play. If your company isn’t currently using this type of software, you’re missing out on a better way of leveraging your organization’s data.

Problems with Traditional Methods

No matter what line of work you’re in, your company is creating all kinds of information each and every business day. We’re not just talking about copy for your website or advertising materials either. Information is created with each and every transaction and just about any time you interact with a customer.

You’re also not just creating information in these moments. You need to access stored data too. If you don’t do this well—and many, many companies don’t—you’re going to end up with a lot of unnecessary problems. Of course, you’re also wasting a lot of money on all that info you’re creating but not using.

The main problem with traditional methods of data storage and retrieval is that they rely on movement and replication processes and intermediary servers and connectors for integrating via a point-to-point system. This worked for a long time. For a few companies, it may seem like it’s still working to some degree.

However, there’s a high cost to this kind of data movement process. If you’re still trying to shoulder it, chances are your overhead looks a lot worse than competitors that have moved on.That’s not the only problem worth addressing, though. There’s also the huge growth of data that’s continuing to head north. We’ve touched on this, but the problem is so prevalent that there’s a term for it throughout every industry: Big Data. Obviously, it refers to the fact that there is just so much information out there, but the fact this term exists also shows how much it affects all kinds of companies.

Big Data is also a statement about its creation. Its sheer proliferation is massive. Every year, the amount increases at an exponential rate. There’s just no way the old methods for storing and finding it will work.

Finally, customers expect that you’ll be able to find whatever information you need to meet their demands. Whether it’s actual information they want from you or it’s just information you need to carry out their transaction, most won’t understand why this isn’t possible. After all, they use Google and other search engines every day. If those platforms can find just about anything for free, why can’t your company do the same?

The Solution Is Data Virtualization

If all of the above sounded a bit grim, don’t worry. There’s a very simple solution waiting for you in data virtualization. This type of software overcomes the challenges that come with your data being stored all over your enterprise. You never again have to run a million different searches to collect it all or come up with some halfway decent workaround. Finally, there’s a type of platform made specifically for your company’s data gathering needs.This software isn’t just effective. It’s convenient too. You work through one, single interface and can have access to the entirety of your company’s data store. What it does is effectively separate the external interface from the implementation going on inside.

How It Works

Every data virtualization platform is going to have its own way of working, so it’s definitely worth researching this before putting your money behind any piece of software. However, the basic idea remains the same across most titles. With virtualization software, the data doesn’t have to be physically moved because this technology uses meta data to create a virtual perspective of the sources you need to get to. Doing so provides a faster and much more agile way of getting to and combining data from all the different sources available:·      Distributed ·      Mainframe ·      Cloud·      Big DataAs you can probably tell, this makes it much easier to get your hands on important information than the way you’re currently doing it.

Finding the Right Title for Your Company

Although they all serve the same purpose and the vast majority will follow the outline we just went through, there are still enough virtualization software titles out there that it’s worth thinking about what your best option will look like. As with any type of software, you don’t want to invest your money anywhere it’s not going to do the most good.The good news is that this software has been around long enough that there have been best practices established for how it should work. First, you want to look for a title that will actually transform the way your mainframe works by leveraging what it can do for every search. This is just a good use of your resources and gives you more bang for your buck as far as your mainframe is concerned. Software that uses it for virtualization purposes is going to work even better than a distribution-based application and won’t cost any more.

However, it’s also going to work a lot better for that price too. A lot of companies also love that this way of carrying out a search is more secure as well. The last thing you want is to pay for a piece of software that’s actually going to leave you worse off.

Secondly, although this may sound complex, you can and should find a solution that keeps things simple. Data integration can be straightforward with the method we just described without any need for redundant processes that would slow down your ability to scale up.

With data virtualization, there is no downside. Furthermore, it’s becoming more and more of a necessity. Companies are going to have to invest in this software as Big Data continues to get bigger.

Mike Miranda writes about enterprise software and covers products offered by software companies like Rocket Software.about topics such as Terminal Emulation, Legacy Modernization, Enterprise Search, Big Data and Enterprise Mobility.

Categories: DBA Blogs

Don’t Settle When It Comes to Enterprise Search Platforms

Fri, 2015-09-04 11:32
No company should try to operate for very long without an enterprise search platform. With so many options out there, though, you could be confused about which one is worth a spot in your organization’s budget. Let’s look at two very common workarounds some have tried, and then we will talk about why you must go with a reputable developer when you make your final decision.
Leveraging Web Search Engines
One way a lot of companies have handled their need for an enterprise search platform is to simply use one like Google or some other web engine. On paper, this may seem to make a lot of sense. After all, who doesn’t trust Google? Most of us use it every single day and would be lost without the search engine giant. If you can leverage its power for your company’s needs, that would seem like a no-brainer, right?
Unfortunately, if you try this, you’ll learn the hard way that Google leaves a lot to be desired when it comes to this type of an environment. That’s not to say it won’t work; it just won’t work well and definitely not nearly as well as you want for a search engine working your company’s internal systems. Google and other web search engines are fantastic at what they’re designed to do. They are not designed to work in an enterprise search environment though. For that, you need a true enterprise search platform.

The major problem is that web search engines are all about sheer volume, which makes a lot of sense once you think about it. When you want to find a pizza parlor in your city, you want to know about every single option and then some. Google’s ability to harvest this much information and present it quickly is one of the reasons it’s so popular. Any search engine that can’t deliver this kind of volume is going to alienate users and soon be left on the scrap heap.

What web search engines like Google don’t do well, though, is carry out deep, detail-oriented searches. Again, this makes sense. Google is driven largely by keywords and backlinks. These are “surface searches” that weren’t created to do some of the tasks you need an enterprise search platform form.

Your employees may do some very simple searches, but they probably also have some pretty demanding queries too. Enterprise search platforms are designed to handle these types of searches and drill down to the last detail in any file or piece of data they come across. If your employees aren’t able to do these types of searches on a regular basis, the search software you invested in will be a waste of money. Worse, it could land you with all kinds of other problems because your people will think they’re doing the best possible search they can and concluding that what they want can’t be found.

Also, don’t forget that your company stores information in different places. Yes, it may all be on the same server, but in the digital world, there are various “silos” that hold onto information. Each silo is its own environment with its own rules. When you try using a web search engine to look through all your company’s silos, what will most likely happen is that it will have to go through one at a time. This is far from ideal, to say the least.

If you have a good number of silos, your employees will most likely give up. They won’t want to walk the search engine from one silo to the next like they’re holding onto the leash of a bloodhound. The whole point of a search engine is that it’s supposed to cut down on the exhaustive amount of “manual” work you’d otherwise have to do to find the data you need.

Silos aren’t all the same, so you want a search program that can go in and out of the type you have without requiring the employee to reconfigure their query.

Open Source Software

Another very popular method of acquiring enterprise search software is to go with an open source title. Once again, on paper, this seems like a very logical route to take. For one thing, the software is free. You can’t beat that price, especially when it comes to an enterprise-level platform. This usually seems like an unbeatable value for small- and medium-sized businesses that don’t have a lot of leeway where their budget is concerned.

That’s just one benefit that proponents of open source search engines tout though. There’s also the fact that you can modify the software as you see fit. This gives the user the ability to basically mold the code into a result that will fit their company’s needs like a glove.

There are a couple problems though. The first is that you get what you pay for. If your open source software doesn’t deliver, you have no one to complain to. You can always do your research to find a title that others have given positive reviews to. At the end of the day, though, don’t expect much in the way of support. There are plenty of forums to go through to find free advice, but that’s not the type of thing most professionals want to wade through.

When you go with a true, professional version, your employees will never be far from help if something goes wrong. Most companies these days have email and phone lines you can use, but many also have chat boxes you can open up on their website. None of these will be available for your company if you go with open source software.

Also, you can definitely modify the search engine software, but this isn’t necessarily unique to open-source platforms. Professional search platforms can be modified a number of ways, allowing the user to streamline their software and fine-tune the result so they get relevant results again and again.This type of architecture, known as a pipeline, is becoming more and more the standard in this industry. Enterprise platforms come with all kinds of different search features, but that can also be a problem if they start getting in the way of one another. To ensure there are never too many cooks in the kitchen, pipeline architecture is used to line them all up, one in front of the other. By doing so, you’ll have a much easier time getting the search results you want, especially because you can just reconfigure these features as you see fit whenever you like.

Ongoing Updates Are Yours

One very important aspect of professional enterprise search platforms that is worth pointing out is that most developers are constantly putting out updates for their product. This is the same thing web search engines do, of course. Google, Yahoo and Bing all release upgrades constantly. The difference, however, is that enterprise platforms get upgrades that are specific to their purposes. While there are updates for open source software, expect sporadic results. The developer of your favourite title could give up and go on to another project, leaving you to look for someone else to continue creating great updates.

If you have a skilled developer who is familiar with open source search engines on your team, this may be an attractive option. Still, most will find this route is just too risky. Most of us also don’t have that kind of developer on staff and it wouldn’t be worth it to hire someone on specifically for this reason (it’d be much more affordable to just buy professional software). Also, remember that, even if you do have this kind of talent within your ranks, you’ll soon become completely beholden to them if you start trusting them with this kind of job. Having someone who is completely responsible for your search engine being able to work and not having someone else on staff who can offer support or replace them is not a good idea.

Scalability Is a Given
Every company understands how important scalability is. This is especially true when it comes to software though. The scalability of a program can really make or break its value. Either it will turn into a costly mistake that greatly holds your business back or it will become the type of agile asset you actually take for granted, it’s so helpful.

Open source platforms are only as scalable as their code allows, so if the person who first made it didn’t have your company’s needs in mind, you’ll be in trouble. Even if they did, you could run into a problem where you find out that scaling up actually reveals some issues you hadn’t encountered before. This is the exact kind of event you want to avoid at all costs.

Now that you realize the importance of going with a reputable developer, your next step is picking which one to choose. You definitely won’t lack for options these days, so just take your time to ensure you go with the best one for your business.
Mike Miranda writes about enterprise software and covers products offered by software companies like about topics such as Terminal Emulation, Legacy Modernization, Enterprise Search, Big Data and Enterprise Mobility.
Categories: DBA Blogs

Query existing HBase tables with SQL using Apache Phoenix

Thu, 2015-07-02 13:25
Spending a bit more time with Apache Phoenix and reading again my previous post I realised that you can use it to query existing HBase tables. That is NOT tables created using Apache Phoenix, but HBase - the columnar NoSQL database in Hadoop.

I think this is cool as it gives you the ability to use SQL on an HBase table.

To test this, let's say you login to HBase and you create an HBase table like this:

> create 'table2', {NAME=>'cf1', VERSIONS => 5}

The table2 is a simple table in HBase with one column family cf1 and now let's put some data to this HBase table.

> put 'table2', 'row1', 'cf1:column1', 'Hello SQL!'

then maybe add another row

> put 'table2', 'row4', 'cf1:column1', 'London'

Now, in Phoenix all you will have to do is create a database View for this table and query it with SQL. The database View will be read-only.  How cool is that, you don't even need to physically create the table or move the data to Phoenix or convert it, a database view will be sufficient and via Phoenix you can query the HBase table with SQL.

In Phoenix you create the view for the table2 using the same name. As you can see below the DDL used to create the view is case sensitive and if you created your HBase table name in lower case you will have to put the name in between double quotes.

So login to Phoenix and create the "table2" view like this:

> create view "table2" ( pk VARCHAR PRIMARY KEY, "cf1"."column1" VARCHAR );

And here is how you then query it in Phoenix:

SQL Query on Phoenix
Tremendous potential here, imagine all those existing HBase tables which now you can query with SQL. More, you can point your Business Intelligence tools and Reporting Tools and other tools which work with SQL and query HBase as if it was another SQL database.

A solution worth investigating further? It definitely got me blogging in the evenings again.

To find out more about Apache Phoenix visit their project page

Categories: DBA Blogs

Apache Phoenix, SQL is getting closer to Big Data

Tue, 2015-06-30 15:50

Here is a post about another project in the Big Data world, like Apache Hive from my previous post, enables you to do SQL on Big Data. It is called Apache Phoenix.

Phoenix is a bit different, a bit closer to my heart too, as I read the documentation on Apache Phoenix, the word 'algebra' and 'relational algebra' came across few times, and that mean only one thing, SQL! The use of the word algebra in the docs did give me a lot of confidence. SQL has closure, is based on a database systems model which has it's roots in logic and maths and especially a subset of algebra, The Set Theory.

Apache Phoenix is developed in Salesforce and is now one of the popular projects in Apache. Apache Phoenix is a SQL skin on top of HBase, the columnar (NoSQL) database of the Hadoop ecosystem, capable of storing very large tables and data and query them via 'scans'. HBase is part of the Hadoop ecosystem and the file system it uses is usually HDFS. Apache Phoenix is using JDBC on the client as a driver.

In the race to bring the easiest to use tools for Big Data, I think Apache Phoenix is very close. It is the SQL we know used since the 1970s. The Apache Phoenix team seems to be committed and willing to introduce all of the missing parts of SQL, including transaction processing with different isolation levels.  Making Phoenix a fully operational Relational Database layer on HBase. Have a look in their roadmap. The amount of current and suggested future SQL compatibility is remarkable, and this makes me take them really seriously.
  • Transactions
  • Cost-based Query Optimization! (Wow)
  • Joins
  • OLAP
  • Subqueries
  • Striving for full SQL-92 compliance
In addition to all this, it is also possible to turn an existing HBase table to an Apache Phoenix table using CREATE TABLE or even CREATE VIEW, the DDL statements that we know. How handy is that? Suddenly you can SQL enable your existing HBase database!
How to install and use Phoenix

The SQL skin can be installed to an existing Hadoop HBase installation very quickly. All you need to do is to download and extract the tarball. You can setup a standalone Hadoop environment, look at my previous blog post for that, and then install HBase and install Apache Phoenix
Once the Apache  Phoenix software is installed, then you can start it and query it with SQL like this.

From within the bin/ directory of Phoenix install directory run

$ ./  localhost

That will bring you to the phoenix prompt

0: jdbc:phoenix:localhost> select * from mytable;

Categories: DBA Blogs

Hive (HiveQL) SQL for Hadoop Big Data

Thu, 2015-06-25 13:30

In this  post I will share my experience with an Apache Hadoop component called Hive which enables you to do SQL on an Apache Hadoop Big Data cluster.

Being a great fun of SQL and relational databases, this was my opportunity to set up a mechanism where I could transfer some (a lot)  data from a relational database into Hadoop and query it with SQL. Not a very difficult thing to do these days, actually is very easy with Apache Hive!

Having access to a Hadoop cluster which has the Hive module installed on, is all you need. You can provision a Hadoop cluster yourself by downloading and installing it in pseudo mode on your own PC. Or you can run one in the cloud with Amazon AWS EMR in a pay-as-you-go fashion.

There are many ways of doing this, just Google it and you will be surprised how easy it is. It is easier than it sounds. Next find links for installing it on your own PC (Linux).  Just download and install Apache Hadoop and Hive from Apache Hadoop Downloads

You will need to download and install 3 things from the above link.

  • Hadoop (HDFS and Big Data Framework, the cluster)
  • Hive (data warehouse module)
  • Sqoop (data importer)
You will also need to put the connector of the database (Oracle, MySQL...) you want to extract data from in the */lib folder in your Sqoop installation. For example the MySQL JDBC connector can be downloaded from hereDon't expect loads of tinkering installing Apache Hadoop and Hive or Sqoop, just unzipping binary extracts and few line changes on some config files in directories, that's all. Is not a big deal, and is Free. There are tones of tutorials on internet on this, here is one I used from another blogger bogotobogo.

What is Hive?

Hive is Big Data SQL, the Data Warehouse in Hadoop. You can create tables, indexes, partition tables, use external tables, Views like in a relational database Data Warehouse. You can run SQL to do joins and to query the Hive tables in parallel using the MapReduce framework. It is actually quite fun to see your SQL queries translating to MapReduce jobs and run in parallel like parallel SQL queries we do on Oracle EE Data Warehouses and other databases. :0) The syntax looks very much like MySQL's SQL syntax.

Hive is NOT an OLTP transactional database, does not have transactions of INSERT, UPDATE, DELETE like in OLTP and doesn't conform to ANSI SQL and ACID properties of transactions.

Direct insert into Hive with Apache Sqoop:
After you have installed Hadoop and have hive setup and are able to login to it, you can use Sqoop - the data importer of Hadoop - like in the following command and directly import a table from MySQL via JDBC into Hive using MapReduce.
$  sqoop import -connect jdbc:mysql://mydatbasename -username kubilay -P -table mytablename --hive-import --hive-drop-import-delims --hive-database dbadb --num-mappers 16 --split-by id
Sqoop import options explained:
  •  -P will ask for the password
  • --hive-import which makes Sqoop to import data straight into hive table which it creates for you
  • --hive-drop-import-delims Drops \n\r, and \01 from string fields when importing to Hive. 
  • --hive-database tells it which database in Hive to import it to, otherwise it goes to the default database. 
  • --num-mappers number of parallel maps to run, like parallel processes / threads in SQL
  • --split-by  Column of the table used to split work units, like in partitioning key in database partitioning. 
The above command will import any MySQL table you give in place of mytablename into Hive using MapReduce from a MySQL database you specify.

Once you import the table then you can login to hive and run SQL to it like in any relational database. You can login to Hive in a properly configured system just by calling hive from command line like this:

$ hive

More Commands to list jobs:

Couple of other commands I found useful when I was experimenting with this:

List running Hadoop jobs

hadoop job -list

Kill running Hadoop jobs

hadoop job -kill job_1234567891011_1234

List particular table directories in HDFS

hadoop fs -ls mytablename

More resources & Links

Categories: DBA Blogs