We share our skills to maximize your revenue!
I needed to write a new script that was running on a Red Hat Linux 6 virtual machine and that would connect to databases using SQL*Plus. I was going to write a bash shell script but decided to use Python instead to see if I could do it using the Python that came with this version of Linux. I wont paste the entire script here but the key was to run SQL*Plus from Python instead of a shell script. Here is a simple example showing how I did it:
$ cat test.py import subprocess """ Example of running a sqlplus script from python 2.6.6. """ def run_sqlplus(sqlplus_script): """ Run a sql command or group of commands against a database using sqlplus. """ p = subprocess.Popen(['sqlplus','/nolog'],stdin=subprocess.PIPE, stdout=subprocess.PIPE,stderr=subprocess.PIPE) (stdout,stderr) = p.communicate(sqlplus_script) stdout_lines = stdout.split("\n") return stdout_lines sqlplus_script=""" connect test/test select * from dual; exit """ sqlplus_output = run_sqlplus(sqlplus_script) for line in sqlplus_output: print line
Here is the output:
$ python test.py SQL*Plus: Release 22.214.171.124.0 Production on Fri Nov 4 15:44:30 2016 Copyright (c) 1982, 2014, Oracle. All rights reserved. SQL> SQL> Connected. SQL> D - X SQL> Disconnected from Oracle Database 12c Enterprise Edition Release 126.96.36.199.0 - 64bit Production With the Partitioning, OLAP, Advanced Analytics and Real Application Testing options
The function run_sqlplus takes a multi-line string as input. This is the text of a SQL*Plus script. It runs sqlplus /nolog to get a SQL*Plus prompt and then passes the strings in as the lines typed at the prompt. The function returns SQL*Plus’s output as a list of strings.
Anyway, this is just an example. I built a real script using these concepts. In this case we are using Python 2.6.6 without cx_Oracle installed so I couldn’t connect directly to an Oracle database. Instead I just ran SQL*Plus from Python.
In this blog post I will explain the migration of the crypto key (aek.key) to the crypto keystore (lockbox) for an upgraded repository. In fact, during customer activities, I has been involved in the migration of a repository hosted on a content server 6.7 SP2 to a new infrastructure. A content server 7.2 P05 was on the new infrastructure where the lockbox was already used by other repositories. The pre-requisite to achieve this migration is to have an already configured repository using the lockbox which will help to re-encrypt the password of the database owner stored later in the dbpasswd.txt file. The repository for which we wanted to migrate the crypto key was properly upgraded to the content server 7.2 P05 using the Content Server Configuration Program keeping the existing aek.key from the old system. Our goal was to align the cryptography mechanism to use the lockbox.
Please note that the above instructions cannot be used if the file store is encrypted with the Trusted Content Services.
We will also recommend removing LDAP objects and re-create them later so the password can be properly re-encrypted and DER et PEM file re-created under $DOCUMENTUM/dba/secure/ldapdb.
First of all we need to check which encryption algorithm is currently used by retrieving the r_crypto_mode from the docbase config
1> select r_crypto_mode from dm_docbase_config 2> GO r_crypto_mode ---------------------------------------------------------------- 3DES_RSA1024_SHA256 (1 row affected)
Before shutting down the repository we will retrieve a list of all users that are configured with an inline password. We will have to reset their password as the encryption mechanism will be changed.
Query to retrieve inline password users:
select user_login_name from dm_user where user_source = 'inline password';
We can now shutdown the repository:
It’s a best practice to backup the aek.key and the lockbox files located under the $DOCUMENTUM/dba/secure folder:
cd $DOCUMENTUM/dba tar -czvf secure-bck-MigAek-$(date +%Y%m%d).tgz secure
It’s also a good practice to backup all files that will be updated by this procedure:
cd $DOCUMENTUM/dba/config/repo1 cp server.ini server.ini.bck-MigAek-$(date +%Y%m%d) cp dbpasswd.txt dbpasswd.txt.bck-MigAek-$(date +%Y%m%d)
We will now connect to the Database schema to remove all encryption values. During the next startup of the repository, encryption values will be properly regenerated automatically.
Once connected you can issue the following SQL instructions :
update dm_docbase_config_s set i_crypto_key = ' ', i_ticket_crypto_key = ' '; delete from dmi_vstamp_s where i_application = 'dm_docbase_config_crypto_key_init'; delete from dmi_vstamp_s where i_application = 'dm_docbase_config_ticket_crypto_key_init'; delete dm_sysobject_s where r_object_id = (select r_object_id from dm_public_key_certificate_s where key_type = 1); delete dm_sysobject_r where r_object_id = (select r_object_id from dm_public_key_certificate_s where key_type = 1); delete dm_public_key_certificate_s where key_type = 1; delete dm_sysobject_s where r_object_id = (select r_object_id from dm_cryptographic_key_s where key_type = 1); delete dm_sysobject_r where r_object_id = (select r_object_id from dm_cryptographic_key_s where key_type = 1); delete dm_cryptographic_key_s where key_type = 1; commit;
The repository initialization file can now be updated to move from the crypto key (aek.key) to the the crypto keystore (lockbox):
Inside this file, you should comment the following lines:
crypto_keystore=Local crypto_mode=3DES_RSA1024_SHA256 crypto_keyname=aek.key
Then uncomment/add the following lines:
crypto_mode = AES256_RSA1024_SHA256 crypto_keystore = Local crypto_lockbox = lockbox.lb crypto_keyname = CSaek
Please make sure that the crypto_mode in the first section is the same as the one retrieved before stopping the repository. The crypto_mode of the second section (new one) can contain the value that you want. The strongest mode available is “AES256_RSA1024_SHA256″ so that’s probably the value that you will want to set. In our case, we faced an issue because of the old repository and we had to keep the same crypto_mode as the original which was “3DES_RSA1024_SHA256″. Therefore at the end, our server.ini file contained the following:
crypto_mode = 3DES_RSA1024_SHA256 crypto_keystore = Local crypto_lockbox = lockbox.lb crypto_keyname = CSaek
Once this is done, we now need to encrypt the database owner of the repository repo1. But how can we do this since the repository isn’t started? Simple: we need an already available repository on this content server (it’s a prerequisite actually) which is already configured to use the lockbox. We used the global registry for that purpose:
> iapi gr_dbi Please enter a user (dmadmin): Please enter password for dmadmin: EMC Documentum iapi - Interactive API interface (c) Copyright EMC Corp., 1992 - 2015 All rights reserved. Client Library Release 7.2.0050.0084 Connecting to Server using docbase gr_dbi [DM_SESSION_I_SESSION_START]info: "Session 010f583d80095f8a started for user dmadmin." Connected to Documentum Server running Release 7.2.0050.0214 Linux64.Oracle Session id is s0 API> initcrypto,c ... OK API> encrypttext,c,xxxx ... DM_ENCR_TEXT_V2=AAAAEJpJA5bVkJGghYFqrik3kxJ0gaWIRNvhVmZA586sBuFx7NqKnbKSpdUYf/BvJgn10OQpoZXL1T7Y2L+RmwZRUPkWqsv139zbU7u2vw7UYKX3
We can now set the encrypted password to the dbpasswd.txt as follow:
Add the following content inside this file:
When this has been done, the repository is now ready to start:
Check the log files and validate that the repository has been properly started. Something similar should be displayed:
2016-09-05T09:11:28.907641 21003 0000000000000000 [DM_SERVER_I_START_SERVER]info: "Docbase repo1 opened" 2016-09-05T09:11:28.907790 21003 0000000000000000 [DM_SERVER_I_SERVER]info: "Setting exception handlers to catch all interrupts" 2016-09-05T09:11:28.907820 21003 0000000000000000 [DM_SERVER_I_START]info: "Starting server using service name: repo1" 2016-09-05T09:11:29.402450 21003 0000000000000000 [DM_SERVER_I_LAUNCH_MTHDSVR]info: "Launching Method Server succeeded." 2016-09-05T09:11:29.420505 21003 0000000000000000 [DM_SERVER_I_LISTENING]info: "The server is listening on network address (Service Name: repo1, Host Name: content_server_01 :V4 IP)" 2016-09-05T09:11:29.420579 21003 0000000000000000 [DM_SERVER_I_IPV6_DISABLED]info: "The server can not listen on IPv6 address because the operating system does not support IPv6" 2016-09-05T09:11:29.426518 21003 0000000000000000 [DM_SERVER_I_LISTENING]info: "The server is listening on network address (Service Name: repo1_s, Host Name: content_server_01:V4 IP)" 2016-09-05T09:11:29.429969 21003 0000000000000000 [DM_SERVER_I_IPV6_DISABLED]info: "The server can not listen on IPv6 address because the operating system does not support IPv6" 2016-09-05T09:11:31.162134 21003 0000000000000000 [DM_WORKFLOW_I_AGENT_START]info: "Workflow agent master (pid : 21060, session 010095b980000007) is started sucessfully." IsProcessAlive: Process ID 0 is not > 0 2016-09-05T09:11:31.163095 21003 0000000000000000 [DM_WORKFLOW_I_AGENT_START]info: "Workflow agent worker (pid : 21061, session 010095b98000000a) is started sucessfully." IsProcessAlive: Process ID 0 is not > 0 2016-09-05T09:11:32.165584 21003 0000000000000000 [DM_WORKFLOW_I_AGENT_START]info: "Workflow agent worker (pid : 21074, session 010095b98000000b) is started sucessfully." IsProcessAlive: Process ID 0 is not > 0 2016-09-05T09:11:33.167476 21003 0000000000000000 [DM_WORKFLOW_I_AGENT_START]info: "Workflow agent worker (pid : 21087, session 010095b98000000c) is started sucessfully." 2016-09-05T09:11:34.167673 21003 0000000000000000 [DM_SERVER_I_START]info: "Sending Initial Docbroker check-point " 2016-09-05T09:11:34.188560 21003 0000000000000000 [DM_MQ_I_DAEMON_START]info: "Message queue daemon (pid : 21105, session 010095b980000456) is started sucessfully." 2016-09-05T09:11:34.488373 21102 010095b980000003 [DM_DOCBROKER_I_PROJECTING]info: "Sending information to Docbroker located on host (content_server_01) with port (1490). Information: (Config(repo1), Proximity(1), Status(Open), Dormancy Status(Active))." Mon Sep 05 09:12:28 2016 [INFORMATION] [AGENTEXEC 21546] Detected during program initialization: Version: 7.2.0050.0214 Linux64 Mon Sep 05 09:12:32 2016 [INFORMATION] [AGENTEXEC 21546] Detected during program initialization: Agent Exec connected to server repo1: [DM_SESSION_I_SESSION_START]info: "Session 010095b980502d02 started for user dmadmin."
The migration of the encryption key to the encryption keystore is nearly completed. It remains 2 tasks which are the password reset of the internal Documentum users such as the dm_bof_registry, dmc_wdk_preferences_owner and the dmc_wdk_presets_owner users and the password reset of the application specific users (the list created just before stopping the repository) but this will not be described in this blog.
As said before, if the LDAP configuration is required, it is better to remove the LDAP objects before the migration and recreate them now that the repository is running properly using the new encryption mechanisms.
All the steps and information to migrate the encryption mechanisms to the lockbox have been provided and I hope that it can be useful for you.
Cet article Documentum story – Migrate repository cryptography from crypto key to the crypto keystore est apparu en premier sur Blog dbi services.
I would like to share with you an interesting issue you may face while using SQL Server AlwaysOn availability groups and secondary read-only replicas. For those who use secondary read-only replicas as reporting servers, keep reading this blog post because it is about update statistics behavior on the secondary replicas and as you may know cardinality estimation accuracy is an important part of the queries performance in this case.
So a couple of days ago, I had an interesting discussion with one of my MVP French friend about an availability group issue he faced at a customer shop and related to the above topic. Without going into details now, he was surprised to see a difference between the primary replica and one secondary about last update statistic dates as well as rows sampled value for specific statistics. The concerned environment runs with SQL Server 2014 SP2.
First of all, let’s say that having different updated statistic dates between a primary and a secondary is part of a normal process. Indeed, changes related to statistic objects are replicated over the time from the primary to secondaries but it is not uncommon to encounter situation where data from the primary is updated until reaching the update statistic threshold value and to notice auto update statistics triggered by SQL Server on the secondary replica from Reporting queries. But what is more surprising is that this behavior noticed by my friend was exacerbated in his specific context with large tables.
Let’s demonstrate with an example. I was able to easily reproduce the issue on my environment but I was also interested in testing behaviors from different versions of SQL Server in response to this specific issue. In fact, I tried to figure out if the problem concerned only a specific build of SQL Server – SQL Server 2014 SP2 in this case – or if the problem concerns all the SQL Server versions.
Let’s use the AdventureWorks2012 database with the bigTransactionHistory table that contains roughly 34 million of rows (343910073 rows).
Let’s say statistic information you will see later in this blog post came from each secondary that runs on a specific SQL Server version (respectively 2012, 2014 and 2016) by using the following T-SQL script on each secondary replica.
use AdventureWorks2012; select object_name(s.object_id) as table_name, s.name as stat_name, s.is_temporary, ds.last_updated, ds.modification_counter, ds.rows, ds.rows_sampled, CAST(ds.modification_counter * 100. / ds.rows AS DECIMAL(5,2)) AS modif_ratio, ds.steps from sys.stats as s (nolock) cross apply sys.dm_db_stats_properties(s.object_id, s.stats_id) as ds where s.object_id = object_id('dbo.bigTransactionHistory'); go
Let’s begin with the starting scenario where I inserted approximatively 20% of the initial data in the bigTransactionHistory table as you may notice below. During the test we will focus only on the idx_cl_big_TransactionHistoryTransactionDate statistic related to the clustered index on the bigTransactionHistory table.
Let’s update then the idx_cl_big_TransactionHistoryTransactionDate statistic with FULLSCAN from the primary replica. This operation may be part of a maintenance plan on daily / monthly basis depending on your scenario. Here a picture of statistic information from each secondary:
Regarding this first output, we may notice that the modification counter from thesys.dm_db_stats_properties DMF did not drop to zero. To be honest I expected a value equal to 0 here. In addition, executing a Reporting query from each secondary did not have effect in this case. The reporting query is as follows and it is designed to use specifically the idx_cl_big_TransactionHistoryTransactionDate statistic.
use AdventureWorks2012; select count(*) from dbo.bigTransactionHistory where TransactionDate between '20060101' and '20080101'; go
Keep going and let’s continue by inserting a new bunch of data (approximatively 10% more). After running a new update statistics operation with FULLSCAN from the primary (let’s say we are again in the situation where a maintenance plan comes into play) here the corresponding statistic information output from each secondary:
As expected, the modification of the rows counter value increased up to 24% but once again we may only notice that running update statistics with FULLSCAN on the primary doesn’t reset correctly the modification rows counter on the secondary regardless the SQL Server version. Let’s run the Reporting query from each secondary and let’s have a look at the statistic information output
Well, it seems that some update statistics stuff came into play but surprisingly we get different results between versions. Let’s focus first on SQL Server 2012 (first line) where an auto update statistics operation was triggered by SQL Server. Thus the last_updated column value was updated, the concerned statistic is now defined as temporary on the concerned secondary replica and the rows sampled value is different from the previous step (423928 vs 45774517). This is because SQL Server used a default sampling algorithm in this case which does not correspond to that using in the previous step with FULLSCAN method.
Then if we move quickly to the last two lines (respectively SQL Server 2014 and 2016), we may notice only one change that concerns the is_temporary column and no changes concerning either the last update statistic date, the modification counter value or sampled rows. At this point, I’m not sure to understand the reason. Is it a metadata issue? Is it a normal behavior? Well, I will go back there to update this section if I get further information.
Let’s continue by performing the same previous tasks (insert a bunch of data and then update statistics with FULLSCAN from the primary). The statistic output from each secondary is as follows:
I’m sure you are beginning to understand what’s happening here. The update statistic with FULLSCAN from the primary replica seems to never reset the modification counter on each secondary. This is a big problem here because if we execute the Reporting query on each secondary we have now good chance to invalidate quickly what has been done by the update statistics operation with FULLSCAN from the primary. In our context, the main concern is the sampled rows value that can lead to inaccurate cardinality estimations. Let’s run again the Reporting query from each secondary and let’s have a look at the corresponding statistics information output
Got it! This time, each concerned statistic switched to temporary and the modification counter was reset to 0. Note also that a default sampling was used in this case in contrast to previous picture that concerned update statistic with FULLSCAN. And if I continue ever and ever in this way I will face every time the same behavior for all versions of SQL Server since 2012.
Let’s summarize the situation: in common scenarios, we would probably not pay attention to this specific issue because cardinality estimation will not be affected and statistics concerned by Reporting queries on the secondaries are fewer in order of magnitude. But my friend encountered a specific scenario with large tables where the issue is magnified.
As workaround, you may consider to exclude the concerned statistic(s) from the auto update statistics algorithm (NORECOMPUTE option). You may also vote for the following connect item opened by my friend if it makes sense for you!
Hope it helps!
Cet article SQL Server AlwaysOn availability groups and statistic issues on secondaries est apparu en premier sur Blog dbi services.
A Guest Post by VP Des Cahill, head evangelist for Oracle's CX Cloud Suite
After completing (and recovering from!) my first CX Central at Oracle OpenWorld, I made my way to Moscone Center in San Francisco one more time to check out Dreamforce 2016.
This time I wouldn’t be worried about my slides or prepping for analyst briefings or a video interview with the Cube. Instead, it would be an opportunity to see how the competition runs its event and to listen for the key messages it is sending to the market.
Salesforce always does a good job with event production.
And this year was no different. The DJ playing funk on the first floor of Moscone set a positive tone from the moment I walked in to register. The conference followed a National Parks theme and key customers were identified as “trailblazers.” This theme was carried out to the nth degree in signage, props, t-shirts, cutesy mascots, slide templates, and other production elements.
Salesforce galvanized its presenters around a single, overarching technology message: AI.
I’m always impressed by Salesforce's ability to prominently weave an overall message into the fabric of the event. The key message this year was artificial intelligence, in the form of Einstein, the umbrella brand for Salesforce AI efforts. And yes, there was even an Einstein mascot roaming around during the keynotes.
Cute as the Einstein mascot was, he couldn’t make the message compelling for me.
The AI message fell flat for three reasons:
1. In some cases, Einstein was simply a re-brand of an old feature. If any feature even remotely smacked of AI, Salesforce slapped the Einstein brand on it.
As a CMO, it’s hard for me to remember a time when marketing automation software wasn’t capable of lead scoring. But it must be new to Salesforce Marketing Cloud, because Salesforce suddenly branded it as Einstein. What? It didn’t have lead scoring before?
Same with Salesforce Commerce Cloud: Salesforce just bought DemandWare in June, but apparently it was built with Einstein even before it was acquired. Because yes, the shopping recommendation engine is powered by Einstein. (And the shopping demo I saw was nothing new.) This transparent attempt to elevate the Einstein brand only served to diminish its value.
You can’t fool customers, Salesforce.
2. There isn’t a lot of intelligence behind Einstein. Intelligence is the ability to acquire and apply knowledge, and Salesforce isn’t providing any new knowledge to customers. Unlike Oracle, Salesforce can only apply its AI to the data you’ve collected in Salesforce apps. What if you just started using its apps? What if you’re a smaller customer without a lot of data?
With Oracle Data Cloud and Adaptive Intelligent Apps, customers can augment existing customer data with rich, third-party data, in order to drive really intelligent offers, actions, and decisions. Einstein cannot do this.
3. Einstein faces a bit of a hype backlash in achieving widespread adoption by Salesforce customers. One analyst I spoke with noted that most Salesforce customers are still uneasy about embracing last year’s Dreamforce hype—Lightning. That same feeling applies to Einstein, and Salesforce didn’t do itself any favors with “everything but the kitchen sink” positioning.
So, what do you get when you take a bunch of existing parts from across your platform, slap them together, and relabel them as something new?
Frankenstein—not Einstein. And it is really scary.
As a newcomer to ODI I enjoy hearing from the more seasoned veterans about common situations they encounter on projects. One of these recurring situations (especially if the company has a very small dev team) is the lack of security. I will not discuss how Oracle improved security by using public/private key pairs for Cloud services, external hackers or any of the buzz words the media likes to toss about. But, I will share with you an easy way to create profiles in ODI to setup a more secure work environment.
Generally speaking, security is neglected because admins, operators or users are not aware of how to set it up or they find it too limiting and tedious to deal with. Other times you might see the exact opposite, where someone has it so locked down you have to request project permissions on the hour just to get work done (Pro-tip: never let control freaks setup or manage security! Just kidding. Maybe.)
Prior to starting any security profile setups, make sure to sit down and really put some thought into the types of profiles you want to create. Think about the different types of work being done in ODI (developer, operator, etc) and what level of permission someone may require. Review the built-in generic profiles here. Keep in mind that you will need to setup security in each environment (Dev, Test, QA, Prod and any others you might use) that you want to connect to. No security setup 'automatically' transfers over to other environments, and not all users require access to each environment.
In this tutorial we will take into consideration the following users:
- DI Projects Developer - Level I
- Senior BI Project Manager
We will setup the Security Profile (access) for each user and connect it to the appropriate User for the DEV environment.
NOTE: This tutorial is specific to ODI internal password storage and authentication, not external authentication.
The first step is to decide what type of security profile (access) each user will need.
- DI Projects Developer - Level I: Entry level DI developer. Should be able to develop in select projects only. Should also have 'view-all' access across the environment
- Senior BI Project Manager: Full access to all related tasks in Designer, Operator and Topology. Might also have ability to edit or create new users.
- Consultant: Brought in to assist in developing mappings and to load new data from a recently acquired company Ok, now we can begin the setups.
- In a work environment you will login using an ADMIN or Security Profile that has the credentials to create security profiles. In our example you are assumed to be logged in as Admin, Training or Supervisor and have the correct access to set the profiles up.
- Navigate to the Security tab in ODI.
- Expand the 'Users' accordion and click 'New User' to open up the properties window
- Input Jane's information and assign her a password (abc123) by clicking 'Enter Password'. Make sure that Jane's password will expire in 6 months, forcing her to change it for security purposes. Click 'OK'
- On the same properties window click the 'Authorization' tab on the top left side. We are granting very limited access because her projects, as a entry level developer, are limited. Click 'Projects' and allow her access to all methods within 'Projects'. Select all methods (use the checkmark in the top left of each objects) from the following: Select SAVE after each object group methods have been selected:
If you do not see it, go to the very top menu and click Window > ODI Security Navigator (seen below)
Now we will create the User logins that will be linked to each profile.
Most of you will already have User logins, just double-click the specific 'User Profile' when logged in under the ADMIN or full access account to edit permissions for the User.
Create a login for Jane Maine, our Level I ETL Developer. Assign her the profiles that will allow strictly regulated access in the Designer Navigator to projects and models but allow her view access for everything. (Review the profile descriptions)We see that in order to limit her development access in DEV to specific projects and models in the Designer Navigator, we must use the non-generic profiles NG_DESIGNER and NG_VERSION_ADMIN. We also must include CONNECT so she has the option to connect to DEV.
Fast Review: An object is a representation of a design-time or run-time artifact handled through Oracle Data Integrator. Examples of objects include agents, projects, models, data stores, scenarios, mappings, and even repositories. An instance is a particular occurrence of an object. For example, the Datawarehouse project is an instance of the Project object. A method is an action that can be performed on an object, such as edit or delete.Generic profiles allow access to all methods of all instances of an object. Non-generic profiles are not authorized for all methods on the instances, an admin must grant rights on the methods for each instance.
Best Practice: Always go in to your account and change any temporary password. See the video on how to do that, here.
Your screen should now look like this (description is optional - I always add them in):
- Load Plan
- Procedure Command
- Scenario Variable
Your Security Navigator should look similar to this:
Pro Tip: If you aren't sure your security settings are correct, after your new user/profile is saved, expand the 'Objects' and/or 'Instances' (orange boxes on the screenshots above) under the Users name and see what is available.
Now we need to create the Consultants general User and profile. The Consultant password does not need to expire, since we will let the account expire after a month.
In this situation, the consultant will need nearly full access but not total access to everything. Check all of the generic profiles EXCEPT version admin. Select the NG VERSION ADMIN to allow selective version access. Your screen should look similar to below.
Now we test our user settings. Disconnect ODI and login using each USER you created. Look at the limitations for each user.
There are so many options for creating secure Users and Profiles within ODI that allow the appropriate amount of access with maximum security - and in fact, it's fairly easy. You can block out high level access such as the entire Operator Navigator or Designer Navigator, all the way down to granular level security where you can block out very specific Methods and Objects associated with it.
A word to the wise: It is strongly suggested that you only use a generic SUPERVISOR or ADMIN account that has full ODI access for creating users, profiles, changing passwords, etc. Create your own personal user/profile to perform daily work. The reason for this is to know who specifically is doing what. If the user is assigned ADMIN (or something generic) then there is no way to tell who used the login.
Other suggested settings to try out: You can create Users and Profiles for admin purposes including a 'Designer Navigator only' access, 'Topology Navigator only' access, 'Operator Navigator only' access and variations where you can only access the Designer Navigator and Toplogy navigator, but not the Operator tab.
Make sure to check out these videos and subscribe:
Yesterday a new requirement was raised within Rittman Mead: we wanted to analyse our blog stats and specifically understand the number of page views per author of blog posts published in 2016. The two sources of our data were:
- Google Analytics: a web analytics service provided by Google that tracks and reports website traffic.
- Ghost: our blogging platform, the data extracted from Ghost contains author and post information in JSON format.
My colleague Robin Moffatt already explained in his post how to connect DVD to Google Analytics and how to query JSON files with Apache Drill. Both sources are compatible with DVD, and in this blog post I'll explain my journey in the analysis with the tool to combine both sources.Ghost JSON data preparation
Following Robin's blog I created two Apache Drill views on top of Ghost Data:
- vw_fact_post: containing the post information
create or replace view vw_fact_posts as select po.post.id id, po.post.uuid uuid, po.post.title title, po.post.slug slug, po.post.markdown markdown, po.post.published_by published_by, cast(po.post.published_at as date) published_at, po.post.updated_at updated_at, po.post.created_by created_by, cast(po.post.created_at as date) created_at, po.post.author_id author_id, po.post.meta_description meta_description, po.post.visibility visibility, po.post.`language` lan, po.post.status status from (select flatten(ghost.db.data.posts) post from dfs.tmp.ghost) po;
- vw_dim_author: containing author data.
select author.u.id id, author.u.name name, author.u.slug slug, author.u.password pwd, author.u.email email, author.u.image image, author.u.status status, author.u.`language` lan, author.u.visibility visibility, author.u.last_login last_login, author.u.created_at created_at, author.u.updated_at updated_at, author.u.updated_by updated_by from (select flatten(ghost.db.data.`users`) u from dfs.tmp.ghost) author;
The views are not strictly required for the purpose of the analysis since Drill SQL can be directly injected in DVD however creating them has two advantages:
- the interface between DVD and Drill is cleaner, no complex sql has to be entered and debugged
- the views can be reused for other projects outside DVD if needed
Robin's post provided all the details needed to connect to Google Analytics, no need to add anything there. Apache Drill datasource setting is pretty easy - we just need to specify hostname and port where Drill is running along with the connection username and password.
Once the Drill connection is working I can start importing the views. I do it by selecting the myDrill connection, choosing the
dfs.tmp database, selecting the view I want to import and clicking on Add All or selecting the columns.
When clicking on OK an "Unsupported SQL statements" error may be raised. This is due to the wrong usage of double quotation marks (
") instead of the backtick (
`) needed by Drill. I amended the error by clicking on Enter SQL and changing the Drill SQL as in image below.
Having imported the two sources I can review the Measure/Attribute definitions as well as the aggregation methods. This is a crucial point since the associations are made automatically and could be wrong. In my case author_id column was automatically declared as a Measure, which prevented me from using it in joins. I can fix the definitions by right clicking on the source, select Inspect and amend the Measure/Attribute definition.
Wrangling Google Analytics Data
DVD's Google Analytics connector exposes a set of pre-aggregated views of the tracking data. The Page Tracking view contains a summary of page views, entrances, exit rates and other KPIs at page level - exactly the information I was looking for.
I then started analysing Page Tracking data within DVD, and found some discrepancies within the data.
- we recently moved our blog from Wordpress to Ghost: all pages accessed since rittmanmead.com website is on Ghost have the
/blog/prefix, the same was not happening when we were still in Wordpress.
- the page name could contain a query string appendix like
- Ghost slug contains the pure page name, without folder prefix like
/blog/2014/and without starting and ending
The following is an example of data retrieved from Google Analytics, all for one actual blog post.
The wrangling of Google Analytics data was necessary and DVD's Data Flow component suited perfectly my needs. I started by importing Page Tracking data source, then by adding two columns:
RemovingBlogFromPageName in order to avoid the Ghost/Wordpress problem described above with a
case-whenand removing the initial
PageNameClean to remove extra appendix like
I then added the Aggregate step to define the Measures/Attributes and aggregation types. Finally I saved the result locally.
The two columns were added in two consecutive Add Columns steps since RemovingBlogFromPageName column is a dependency in the in PageNameClean formula. In a single Add Columns step several columns can be added if they are referencing columns already existing in the dataset.Creating the Project
In my previous post I linked several data sources with Data Flow, this is always possible but in this case I tried a different approach: the link between Google Analytics and Ghost data (through Drill) was defined directly in DVD Project itself.
First step was to include the Drill datasources: I added vw_fact_post initially and then vw_dim_author by right clicking and selecting Add Data Souce.
Both sources are now visible in my project and the join between them is auto-generated and based on column name matching. In my case this was wrong and I reviewed and changed it by right clicking, selecting Source Diagram and correcting the joning conditions.
Note that I set the Parameter "This connection will" to "Extend a dimension" since the author data is adding attributes to my post data.
Next step was adding Google Analytics refined dataset that I stored locally with Data Flow. I can do it by right clicking, selecting Add Data Souce and then GAAnalyticsDataCleansed. The new source could not be visible immediately in my project, this was due to the lack of matching columns names for the automatic join to be working. I added the joining condition in the Source Diagram section.
Note that in this case the "This connection will" parameter is set to "Add Facts" since Google Analytics data contains the number of page views and other metrics.
After amending the joining condition I'm finally able to include any columns from my datasource in my project. Here we’ve met the requirement to see post metrics by author, all in DVD and through a very intuitive and accessible interface.