Feed aggregator

Find the number in list which is not present in Database column

Tom Kyte - Sat, 2017-01-14 03:46
I have one requirement where I need to find the first missing number in the List which is not there in Database(Oracle) column. Scenario is like this : Table 1: Link to table image is here :- <code>https://i.stack.imgur.com/iXm6N.png</code>...
Categories: DBA Blogs

Create linenumbers for sqlplus queries on the fly.

Tom Kyte - Sat, 2017-01-14 03:46
Is there a quick an easy way to create sequential line numbers on the fly (without Creating a Sequence or using PL*SQL) for a SQLplus query executed inside SQLplus? Example: Here is a live link to output: http://www.videofame.com/fragrance.asp ...
Categories: DBA Blogs

Oracle Apex applicationson company DB

Tom Kyte - Sat, 2017-01-14 03:46
Hi, I am trying to create oracle apex applications for my company,the thing i am worried about is that the applications will be complex and would require a lot of space on the DB.Can you let me know what is the default DB size that is available if...
Categories: DBA Blogs

Announcing “Database Star Academy”

Complete IT Professional - Fri, 2017-01-13 20:03
All of the online courses I have created used to be on a separate site of mine called Software Developer Academy. These have now all been moved to a new section under my main site, called Database Star Academy. Read on to find out more. What Is Database Star Academy? It’s the name of a membership section on […]
Categories: Development

Fast-Start Failover for Maximum Protection in #Oracle 12c

The Oracle Instructor - Fri, 2017-01-13 11:40

Fast-Start Failover is supported with Maximum Protection in 12cR2. Also Multiple Observers can now monitor the same Data Guard Configuration simultaneously. I will show both in this article. Starting with a (Multitenant) Primary in Maximum Protection mode with two Standby Databases. It is still not recommended to have the highest protection mode configured with only one standby. So this is my starting point:

DGMGRL> show configuration;

Configuration - myconf

Protection Mode: MaxProtection
Members:
cdb1 - Primary database
cdb1sb - Physical standby database
cdb1sb2 - Physical standby database

Fast-Start Failover: DISABLED

Configuration Status:
SUCCESS (status updated 57 seconds ago)

All three databases have flashback turned on. I want to have a setup like this in the end:

FSFO with Max Protection and 2 Observers

FSFO with Max Protection and 2 Observers

This is how it’s been configured:

DGMGRL> edit database cdb1 set property faststartfailovertarget='cdb1sb,cdb1sb2';
Property "faststartfailovertarget" updated
DGMGRL> edit database cdb1sb set property faststartfailovertarget='cdb1,cdb1sb2';
Property "faststartfailovertarget" updated
DGMGRL> edit database cdb1sb2 set property faststartfailovertarget='cdb1,cdb1sb';
Property "faststartfailovertarget" updated
DGMGRL> enable fast_start failover;
Enabled.

On host uhesse4:

[oracle@uhesse4 ~]$ dgmgrl sys/oracle@cdb1
DGMGRL for Linux: Release 12.2.0.1.0 - Production on Fri Jan 13 17:20:52 2017

Copyright (c) 1982, 2016, Oracle and/or its affiliates.  All rights reserved.

Welcome to DGMGRL, type "help" for information.
Connected to "cdb1"
Connected as SYSDBA.
DGMGRL> start observer number_one;
[W000 01/13 17:21:04.85] FSFO target standby is cdb1sb
[W000 01/13 17:21:07.05] Observer trace level is set to USER
[W000 01/13 17:21:07.05] Try to connect to the primary.
[W000 01/13 17:21:07.05] Try to connect to the primary cdb1.
[W000 01/13 17:21:07.05] The standby cdb1sb is ready to be a FSFO target
[W000 01/13 17:21:09.06] Connection to the primary restored!
[W000 01/13 17:21:13.07] Disconnecting from database cdb1.

On host uhesse3:

[oracle@uhesse3 ~]$ dgmgrl sys/oracle@cdb1
DGMGRL for Linux: Release 12.2.0.1.0 - Production on Fri Jan 13 17:22:16 2017

Copyright (c) 1982, 2016, Oracle and/or its affiliates.  All rights reserved.

Welcome to DGMGRL, type "help" for information.
Connected to "cdb1"
Connected as SYSDBA.
DGMGRL> start observer number_two;
[W000 01/13 17:22:32.68] FSFO target standby is cdb1sb
[W000 01/13 17:22:34.85] Observer trace level is set to USER
[W000 01/13 17:22:34.85] Try to connect to the primary.
[W000 01/13 17:22:34.85] Try to connect to the primary cdb1.
[W000 01/13 17:22:34.85] The standby cdb1sb is ready to be a FSFO target
[W000 01/13 17:22:36.86] Connection to the primary restored!
[W000 01/13 17:22:40.86] Disconnecting from database cdb1.

This is now the state of the configuration:

DGMGRL> show configuration;

Configuration - myconf

  Protection Mode: MaxProtection
  Members:
  cdb1    - Primary database
    cdb1sb  - (*) Physical standby database 
    cdb1sb2 - Physical standby database 

Fast-Start Failover: ENABLED

Configuration Status:
SUCCESS   (status updated 33 seconds ago)

DGMGRL> show fast_start failover;

Fast-Start Failover: ENABLED

  Threshold:          15 seconds
  Target:             cdb1sb
  Candidate Targets:  cdb1sb,cdb1sb2
  Observers:      (*) number_two
                      number_one
  Lag Limit:          30 seconds (not in use)
  Shutdown Primary:   TRUE
  Auto-reinstate:     TRUE
  Observer Reconnect: (none)
  Observer Override:  FALSE

Configurable Failover Conditions
  Health Conditions:
    Corrupted Controlfile          YES
    Corrupted Dictionary           YES
    Inaccessible Logfile            NO
    Stuck Archiver                  NO
    Datafile Write Errors          YES

  Oracle Error Conditions:
    (none)

That protects against the failure of any two components in the configuration with automatic failover and zero data loss! For example the first standby may fail and then the primary. We failover to the second standby that becomes the new fast-start failover target:

[oracle@uhesse2 ~]$ ps -ef | grep smon
oracle   15087     1  0 17:40 ?        00:00:00 ora_smon_cdb1sb
oracle   15338  9765  0 17:49 pts/2    00:00:00 grep --color=auto smon
[oracle@uhesse2 ~]$ kill -9 15087

Above crashed the first standby. This is what the Observers report:

[W000 01/13 17:49:34.24] Failed to ping the standby.
[W000 01/13 17:49:37.25] Failed to ping the standby.
[W000 01/13 17:49:40.25] Failed to ping the standby.
[W000 01/13 17:49:43.25] Failed to ping the standby.
[W000 01/13 17:49:46.26] Failed to ping the standby.
[W000 01/13 17:49:46.26] Standby database has changed to cdb1sb2.
[W000 01/13 17:49:47.26] Try to connect to the primary.
[W000 01/13 17:49:47.26] Try to connect to the primary cdb1.
[W000 01/13 17:49:48.34] The standby cdb1sb2 is ready to be a FSFO target
[W000 01/13 17:49:53.35] Connection to the primary restored!
[W000 01/13 17:49:57.35] Disconnecting from database cdb1.

This is the state of the configuration now:

DGMGRL> show configuration;

Configuration - myconf

  Protection Mode: MaxProtection
  Members:
  cdb1    - Primary database
    Error: ORA-16778: redo transport error for one or more members

    cdb1sb2 - (*) Physical standby database 
    cdb1sb  - Physical standby database 
      Error: ORA-1034: ORACLE not available

Fast-Start Failover: ENABLED

Configuration Status:
ERROR   (status updated 14 seconds ago)

Notice that the Fast-Start Failover indicator (*) now points to cdb1sb2. Now the primary fails:

[oracle@uhesse1 ~]$ ps -ef | grep smon
oracle   21334     1  0 17:41 ?        00:00:00 ora_smon_cdb1
oracle   22077  5043  0 17:52 pts/0    00:00:00 grep --color=auto smon
[oracle@uhesse1 ~]$ kill -9 21334

This is what the Observers report:

[W000 01/13 17:52:54.04] Primary database cannot be reached.
[W000 01/13 17:52:54.04] Fast-Start Failover threshold has not exceeded. Retry for the next 15 seconds
[W000 01/13 17:52:55.05] Try to connect to the primary.
[W000 01/13 17:52:57.13] Primary database cannot be reached.
[W000 01/13 17:52:58.13] Try to connect to the primary.
[W000 01/13 17:53:06.38] Primary database cannot be reached.
[W000 01/13 17:53:06.38] Fast-Start Failover threshold has not exceeded. Retry for the next 3 seconds
[W000 01/13 17:53:07.39] Try to connect to the primary.
[W000 01/13 17:53:09.46] Primary database cannot be reached.
[W000 01/13 17:53:09.46] Fast-Start Failover threshold has expired.
[W000 01/13 17:53:09.46] Try to connect to the standby.
[W000 01/13 17:53:09.46] Making a last connection attempt to primary database before proceeding with Fast-Start Failover.
[W000 01/13 17:53:09.46] Check if the standby is ready for failover.
[S019 01/13 17:53:09.47] Fast-Start Failover started...

17:53:09.47  Friday, January 13, 2017
Initiating Fast-Start Failover to database "cdb1sb2"...
[S019 01/13 17:53:09.47] Initiating Fast-start Failover.
Performing failover NOW, please wait...
Failover succeeded, new primary is "cdb1sb2"
17:53:23.68  Friday, January 13, 2017

After having restarted the two crashed databases, they become automatically reinstated and the configuration then looks like this:

DGMGRL> show configuration;

Configuration - myconf

  Protection Mode: MaxProtection
  Members:
  cdb1sb2 - Primary database
    cdb1    - (*) Physical standby database 
    cdb1sb  - Physical standby database 

Fast-Start Failover: ENABLED

Configuration Status:
SUCCESS   (status updated 7 seconds ago)

DGMGRL> show fast_start failover;

Fast-Start Failover: ENABLED

  Threshold:          15 seconds
  Target:             cdb1
  Candidate Targets:  cdb1,cdb1sb
  Observers:      (*) number_two
                      number_one
  Lag Limit:          30 seconds (not in use)
  Shutdown Primary:   TRUE
  Auto-reinstate:     TRUE
  Observer Reconnect: (none)
  Observer Override:  FALSE

Configurable Failover Conditions
  Health Conditions:
    Corrupted Controlfile          YES
    Corrupted Dictionary           YES
    Inaccessible Logfile            NO
    Stuck Archiver                  NO
    Datafile Write Errors          YES

  Oracle Error Conditions:
    (none)

Switching back to make cdb1 primary – this is of course optional:

DGMGRL> switchover to cdb1;
Performing switchover NOW, please wait...
Operation requires a connection to database "cdb1"
Connecting ...
Connected to "cdb1"
Connected as SYSDBA.
New primary database "cdb1" is opening...
Operation requires start up of instance "cdb1sb2" on database "cdb1sb2"
Starting instance "cdb1sb2"...
ORACLE instance started.
Database mounted.
Connected to "cdb1sb2"
Switchover succeeded, new primary is "cdb1"
DGMGRL> show configuration;

Configuration - myconf

  Protection Mode: MaxProtection
  Members:
  cdb1    - Primary database
    cdb1sb  - (*) Physical standby database 
    cdb1sb2 - Physical standby database 

Fast-Start Failover: ENABLED

Configuration Status:
SUCCESS   (status updated 29 seconds ago)

I think this enhancement is really a big deal!


Tagged: 12c New Features, Data Guard
Categories: DBA Blogs

Ricardo Eletro Reduces Operational Cost by 40m Brazilian Reais per Month with Oracle Retail

Oracle Press Releases - Fri, 2017-01-13 11:00
Press Release
Ricardo Eletro Reduces Operational Cost by 40m Brazilian Reais per Month with Oracle Retail Brazilian Consumer Electronics Retailer Streamlines Operations with Oracle Retail Merchandising and Fiscal Management Solutions

Redwood Shores, Calif.—Jan 13, 2017

Today Oracle announced that Ricardo Eletro, the third largest retailer of consumer electronics in Brazil, has deployed Oracle Retail Merchandise Operations Management and Oracle Retail Fiscal Management to enable converged commerce, empower associates and increase operational efficiency. Ricardo Eletro faced the challenge of integrating more than 900 stores and 12 distribution centers in the midst of company transformation. 

The partnership between Oracle Retail and Ricardo Electro began eight years ago. In January 2016, following several mergers and the costly support of disparate systems and varying IT infrastructure maturity levels, the decision was to unify Ricardo Eletro under one brand, and a unified Oracle IT infrastructure. This collaboration empowered the Ricardo Eletro commercial department to optimize the market structure to facilitate open to buy, centralize management, and streamline pricing, while enabling adherance to federal regulations. 

"We needed to establish a foundation for the new organization. We evaluated the systems and existing investment of all our companies. Oracle emerged as the clear choice to establish a foundation for growth for Ricardo Eletro," said Edson Tavares, CIO of Ricardo Eletro.

“The ‘Unification’ project transformed the company with a simplified infrastructure, a unified workforce and a better appreciation of customers while reducing operation costs by 40M Brazilian Reais per month,” Said Vinicius Hamzi, Planning and Management Vice president of Ricardo Eletro. “Our team empowered the holding and operations teams with a culture of transparency and efficiency. We launched a Zero-Based Budget (ZBB) and we adopted the best practices and retail rich functionality of Oracle Retail and Oracle E-Business Suite.”

Ricardo Eletro selected Oracle Retail implementation partner BRX Retail to support this project. BRX Retail provided local resources with product experience, local understanding of Brazilian tax regulation and the implementation knowledge to deliver and respond quickly.

“With the implementation of Oracle, Ricardo Eletro has proven that there is power in simplicity. Ricardo Electro can now anticipate the needs of the stores and the online channels to better serve the consumers with a simplified an inspiring experience,” said Ray Carlin, Senior Vice President and General Manager, Oracle Retail. “Oracle has dedicated resources to localize the requirements for the Brazil market. Oracle Retail is thrilled to partner with Ricardo Eletro to extend its foundation for growth.”

Oracle Retail at NRF 2017

Oracle Retail will be showcasing the full suite of Oracle Retail solutions and cloud services at the National Retail Federation Big Show Jan. 15-17, 2017, in New York City at the Jacob K. Javitz Convention Center. Oracle Retail will be located at booth #2521. For more information check out: www.oracle.com/nrf2017

Contact Info
Matthew Torres
Oracle
+415.595.1584
matt.torres@oracle.com
About Oracle

Oracle offers a comprehensive and fully integrated stack of cloud applications and platform services. For more information about Oracle (NYSE:ORCL), visit www.oracle.com.

Trademarks

Oracle and Java are registered trademarks of Oracle and/or its affiliates. Other names may be trademarks of their respective owners.

Safe Harbor

The following is intended to outline our general product direction. It is intended for information purposes only, and may not be incorporated into any contract. It is not a commitment to deliver any material, code, or functionality, and should not be relied upon in making purchasing decisions. The development, release, and timing of any features or functionality described for Oracle's products remains at the sole discretion of Oracle Corporation. 

Talk to a Press Contact

Matthew Torres

  • +415.595.1584

2016 is over and it was great, 2017 will be even more exiting in the open source area

Yann Neuhaus - Fri, 2017-01-13 10:05

Disclaimer: This will not be a technical post at all. The goal is to provide a short overview of what we did in 2016 and what we plan to do in 2017 when it comes to our Open Infrastructure division. We’ll start with a few numbers then look at the events we did and some of the projects we completed in 2016. Here we go …

We had a total number of 90 blog posts in 2016 covering the following areas: Linux, MySQL/MariaDB, MongoDB, Cassandra, Mirantis, PostgreSQL, EnterpriseDB, Docker, Ansible, Amazon AWS. Here is the list for 2016:

installing-edb-postgres-advanced-server-9-5
the-postgres-plus-cloud-database
avoiding-access-to-the-public-schema-in-postgresql
using-the-official-postgresql-yum-repositories
external-tables-in-postgresql
postgresql-on-amazon-rds-loading-the-beast
postgresql-on-amazon-rds-adding-a-replica-to-the-beast
postgresql-on-amazon-rds-configuring-the-beast
postgresql-on-amazon-rds-securing-the-beast
postgresql-on-amazon-rds-setting-up-the-beast
the-dbi-services-postgresql-reference-architecture-2-the-community-approach
edb-postgres-advanced-server-9-5-new-features-profiles
install-pljava-in-postgresql-9-5-1
feeding-blogsrrs-items-directly-into-your-postgresql-database
pre-warming-the-buffer-cache-in-postgresql
transactional-ddl
install-pljava-in-postgres-plus-advanced-server-9-5
launching-a-vm-with-the-amazon-aws-command-line-tools
linux-how-to-check-the-exit-status-of-several-piped-commands
lvm-how-to-extend-a-volume-group
linux-quick-tip-what-is-the-local-time-in-kolkata
mongodb-installation
introduction-to-mongodb
getting-started-with-ansible-preparations
the-almost-same-sample-schema-for-all-major-relational-databases-4-mssql
connecting-your-postgresql-instance-to-a-mariadbmysql-instance
the-almost-same-sample-schema-for-all-major-relational-databases-3-mysqlmariadb
maintenance-scenarios-with-edb-failover-manager-3-witness-node
maintenance-scenarios-with-edb-failover-manager-2-primary-node
the-almost-same-sample-schema-for-all-major-relational-databases-2-oracle
the-almost-same-sample-schema-for-all-major-relational-databases-1-postgresql
stay-tuned-with-kernel-parameters
maintenance-scenarios-with-edb-failover-manager-1-standby-node
a-look-at-postgresql-9-6-psql-gexec-2
a-look-at-postgresql-9-6-killing-idle-transactions-automatically
a-look-at-postgresql-9-6-progress-reporting-for-vacuum-operations
a-look-at-postgresql-9-6-the-wait-interface
happy-birthday-postgresql
connecting-your-postgresql-instance-to-a-microsoft-sql-server-instance
interested-in-the-most-advanced-open-source-database-where-you-can-meet-us-in-june
postgresql-as-a-central-reporting-hub-yes-of-course
swiss-pgday-2016-slides-are-online
which-parameter-changes-do-require-a-restart-of-my-postgresql-instance
shrinking-oracle-vm-virtualbox-with-zerofree
elasticsearch-kibana-logstash-and-filebeat-centralize-all-your-database-logs-and-even-more
getting-started-with-ansible-creating-the-postgresql-instance
getting-started-with-ansible-download-the-postgresql-sources-compile-and-install
getting-started-with-ansible-installing-os-packages-creating-groups-and-users/
gettin-ansible-up-and-running-on-a-sles-12-sp1-host
how-to-do-a-filesystem-resize-ext3ext4-on-redhat-running-on-vmware
running-postgresql-on-zfs-on-linux-fun-with-snapshots-and-clones
running-postgresql-on-zfs-on-linux
connecting-your-postgresql-instance-to-an-oracle-database-debian-version
the-dbi-open-infrastructure-technology-division
sharding-with-postgresql
what-the-hell-are-these-template0-and-template1-databases-in-postgresql
securing-your-connections-to-postgresql-by-using-ssl
edb-failover-manager-2-1-two-new-features
edb-failover-manager-2-1-upgrading
disable-red-hat-7-transparent-hugepages
auditing-in-postgresql
understanding-row-level-security-on-postgresql
mariadb-audit-plugin
how-to-install-mirantis-openstack-9-0-using-virtualbox-part-1-3
how-to-install-mirantis-openstack-9-0-using-virtualbox-part-1-2
how-to-install-mirantis-openstack-9-0-using-virtualbox-part-1
how-to-patch-postgres-plus-advanced-server-in-a-standby-configuration
running-a-cassandra-cluster-in-a-single-server
how-to-patch-postgres-plus-advanced-server
apache-cassandra-overview
running-postgresql-on-zfs-on-linux-compression
can-i-do-it-with-postgresql-5-generating-ddl-commands
can-i-do-it-with-postgresql-4-external-tables
can-i-do-it-with-postgresql-3-tablespaces
can-i-do-it-with-postgresql-2-dual
can-i-do-it-with-postgresql-1-restore-points
mongodb-installation-on-windows
managing-my-amazon-web-services-redhat-instance
linux-instance-in-amazon-web-services-aws
edb-postgres-advanced-server-9-6-beta-released
can-i-do-it-with-postgresql-8-transportable-tablespaces
getting-started-with-docker-2-building-your-own-base-image
getting-started-with-docker-1-overview-and-installation
enterprisedb-backup-and-recovery-tool-bart
installing-postgresql-9-6-1-with-enterprisedb-installer
can-i-do-it-with-postgresql-7-partitioning
oel-7-project-quotas-on-oracle-homes-with-xfs-on-oracle-linux-7
oel-7-how-to-disable-ipv6-on-oracle-linux-7
from-mysql-oracle-to-postgres-using-the-edb-migration-toolkit
can-i-do-it-with-postgresql-6-server-programming

Quite a lot of information was shared. For 2017 you should see more posts in the Cloudera and OpenStack areas as well as the private DBaaS service we are currently working on.

But obviously we did not only write blogs :) There have been two PostgreSQL events organized by us, we sponsored and had two talks at the Swiss PGDay 2016 and we had a talk at the IT Tage 2016 in Frankfurt.
For 2017 we’ll be organizing an Oracle · Open Infrastructure · Cloudera & Big Data Appliance event, we’ll be at the Swiss PGDay 2017 and probably we’ll be at the IT Tage in Frankfurt again. If nothing goes wrong we’ll be at the M|17 in New York. This is what currently is scheduled, maybe there will be even more.

For the projects we did in 2016 there was a lot of stuff: We did high available EnterpriseDB PPAS projects, we did high available community PostgreSQL projects, we did a MongoDB project and we even did a PostgreSQL project on Windows. We implemented Ansible at a customer to automate the PostgreSQL deployments. There have been several MariaDB and MySQL Galera cluster implementations as well as MySQL Cluster Control setups. Out of all these projects we got several SLAs signed for MySQL as well as for PostgreSQL, which is a great thing as this proves that our customers trust us. On top of that we certified 7 of our service desk people on EnterpriseDB Postgres Plus (EDB Postgres Advanced Server 9.5 Professional). Nobody in the D-A-CH region achieved that until now.
For 2017 there are already some projects in the pipe: One huge migration from another vendor to PostgreSQL/EDB, a migration from another vendor to community PostgreSQL. We will assist a customer in building a high available, extremely flexible PostgreSQL deployment for their web infrastructure. For two other projects we’ll migrate existing PostgreSQL deployments to PostgreSQL 9.5 and there will be one huge 6TB migration from PostgreSQL 9.2 to a recent release. And this is just what is known today.

Workshops: In 2016 we released our PostgreSQL DBA Essentials Workshop which was given four times since then. The MySQL DBA Essentials workshop was also given several times last year.
For 2017 the MySQL workshop will be updated and we plan to release a brand new MongoDB workshop.

In 2016 we released DMK for PostgreSQL and NoSQL.
For 2017 all DMK releases will be updated with the latest and greatest we learned at our customers. One big announcement will be an appliance based on open source products. I can not tell you much right now but there will be some news in the next weeks.

You can see that we invested much in these areas in 2016 and 2017 will be no exception to that. The trend goes even up.

Finally my personal wish list for 2017: Dare to be more open, more flexible and to have fun when you work in your IT landscape. Great things can be build based on open source tools.

PS: To my colleagues: I am sure I forgot one or the other thing we did in 2016 which needs to mentioned here. Just leave a comment to complete the list :)

Happy new 2017
Daniel

 

Cet article 2016 is over and it was great, 2017 will be even more exiting in the open source area est apparu en premier sur Blog dbi services.

Data Processing and Enrichment in Spark Streaming with Python and Kafka

Rittman Mead Consulting - Fri, 2017-01-13 10:00

In my previous blog post I introduced Spark Streaming and how it can be used to process 'unbounded' datasets. The example I did was a very basic one - simple counts of inbound tweets and grouping by user. All very good for understanding the framework and not getting bogged down in detail, but ultimately not so useful.

We're going to stay with Twitter as our data source in this post, but we're going to consider a real-world requirement for processing Twitter data with low-latency. Spark Streaming will again be our processing engine, with future posts looking at other possibilities in this area.

Twitter has come a long way from its early days as a SMS-driven "microblogging" site. Nowadays it's used by millions of people to discuss technology, share food tips, and, of course, track the progress of tea-making. But it's also used for more nefarious purposes, including spam, and sharing of links to pirated material. The requirement we had for this proof of concept was to filter tweets for suspected copyright-infringing links in order that further action could be taken.

The environment I'm using is the same as before - Spark 2.0.2 running on Docker with Jupyter Notebooks to develop the code (and this article!). You can download the full notebook here.

The inbound tweets are coming from an Apache Kafka topic. Any matched tweets will be sent to another Kafka topic. The match criteria are:

  • Not a retweet
  • Contains at least one URL
  • URL(s) are not on a whitelist (for example, we're not interested in links to spotify.com, or back to twitter.com)
  • The Tweet text must match at least two from a predefined list of artists, albums, and tracks. This is necessary to avoid lots of false positives - think of how many music tracks there are out there, with names that are common in English usage ("yesterday" for example). So we must match at least two ("Yesterday" and "Beatles", or "Yesterday" and "Help!").
    • Match terms will take into account common misspellings (Little Mix -> Litle Mix), hashtags (Little Mix -> #LittleMix), etc

We'll also use a separate Kafka topic for audit/debug purposes to inspect any non-matched tweets.

As well as matching the tweet against the above conditions, we will enrich the tweet message body to store the identified artist/album/track to support subsequent downstream processing.

The final part of the requirement is to keep track of the number of inbound tweets, the number of matched vs unmatched, and for those matched, which artists they were for. These counts need to be per batch and over a window of time too.

Getting Started - Prototyping the Processing Code

Before we get into the meat of the streaming code, let's take a step back and look at what we're wanting the code to achieve. From the previous examples we know we can connect to a Kafka topic, pull in tweets, parse them for given fields, and do windowed counts. So far, so easy (or at least, already figured out!). Let's take a look at nub of the requirement here - the text matching.

If we peruse the BBC Radio 1 Charts we can see the popular albums and artists of the moment (Grant me a little nostalgia here; in my day people 'pirated' music from the Radio 1 chart show onto C90 cassettes, trying to get it without the DJ talking over the start and end. Nowadays it's done on a somewhat more technologically advanced basis). Currently it's "Little Mix" with the album "Glory Days". A quick Wikipedia or Amazon search gives us the track listing too:

  1. Shout Out to My Ex
  2. Touch
  3. F.U.
  4. Oops - Little Mix feat. Charlie Puth
  5. You Gotta Not
  6. Down & Dirty
  7. Power
  8. Your Love
  9. Nobody Like You
  10. No More Sad Songs
  11. Private Show
  12. Nothing Else Matters
  13. Beep Beep
  14. Freak
  15. Touch

A quick twitter search for the first track title gives us this tweet - I have no idea if it's legit or not, but it serves as an example for our matching code requirements:

DOWNLOAD MP3: Little Mix – Shout Out To My Ex (CDQ) Track https://t.co/C30c4Fel4u pic.twitter.com/wJjyG4cdjE

— Ngvibes Media (@ngvibes_com) November 3, 2016

Using the Twitter developer API I can retrieve the JSON for this tweet directly. I'm using the excellent Paw tool to do this.

From this we can get the text element:

"text": "DOWNLOAD MP3: Little Mix \u2013 Shout Out To My Ex (CDQ) Track https:\/\/t.co\/C30c4Fel4u https:\/\/t.co\/wJjyG4cdjE",

The obvious approach would be to have a list of match terms, something like:

match_text=("Little Mix","Glory Days","Shout Out to My Ex","Touch","F.U.")

But - we need to make sure we've matched two of the three types of metadata (artist/album/track), so we need to know which it is that we've matched in the text. We also need to handle variations in text for a given match (such as misspellings etc).

What I came up with was this:

filters=[]  
filters.append({"tag":"album","value": "Glory Days","match":["Glory Days","GloryDays"]})  
filters.append({"tag":"artist","value": "Little Mix","match":["Little Mix","LittleMix","Litel Mixx"]})  
filters.append({"tag":"track","value": "Shout Out To My Ex","match":["Shout Out To My Ex","Shout Out To My X","ShoutOutToMyEx","ShoutOutToMyX"]})  
filters.append({"tag":"track","value": "F.U.","match":["F.U","F.U.","FU","F U"]})  
filters.append({"tag":"track","value": "Touch","match":["Touch"]})  
filters.append({"tag":"track","value": "Oops","match":["Oops"]})

def test_matching(test_string):  
    print 'Input: %s' % test_string
    for f in filters:
        for a in f['match']:
            if a.lower() in test_string.lower():
                print '\tTag: %s / Value: %s\n\t\t(Match string %s)' % (f['tag'],f['value'],a)

We could then take the test string from above and test it:

test_matching('DOWNLOAD MP3: Little Mix \u2013 Shout Out To My Ex (CDQ) Track https:\/\/t.co\/C30c4Fel4u https:\/\/t.co\/wJjyG4cdjE')  
Input: DOWNLOAD MP3: Little Mix \u2013 Shout Out To My Ex (CDQ) Track https:\/\/t.co\/C30c4Fel4u https:\/\/t.co\/wJjyG4cdjE
    Tag: artist / Value: Little Mix
        (Match string Little Mix)
    Tag: track / Value: Shout Out To My Ex
        (Match string Shout Out To My Ex)

as well as making sure that variations in naming were also correctly picked up and tagged:

test_matching('DOWNLOAD MP3: Litel Mixx #GloryDays https:\/\/t.co\/wJjyG4cdjE')  
Input: DOWNLOAD MP3: Litel Mixx #GloryDays https:\/\/t.co\/wJjyG4cdjE
    Tag: album / Value: Glory Days
        (Match string GloryDays)
    Tag: artist / Value: Little Mix
        (Match string Litel Mixx)
Additional Processing

With the text matching figured out, we also needed to address the other requirements:

  • Not a retweet
  • Contains at least one URL
    • URL(s) are not on a whitelist (for example, we're not interested in links to spotify.com, or back to twitter.com)
Not a Retweet

In the old days retweets were simply reposting the same tweet with a RT prefix; now it's done as part of the Twitter model and Twitter clients display the original tweet with the retweeter shown. In the background though, the JSON is different from an original tweet (i.e. not a retweet).

Original tweet:

Because we all know how "careful" Trump is about not being recorded when he's not aware of it. #BillyBush #GoldenGate

— George Takei (@GeorgeTakei) January 12, 2017
{
  "created_at": "Thu Jan 12 00:36:22 +0000 2017",
  "id": 819342218611728384,
  "id_str": "819342218611728384",
  "text": "Because we all know how \"careful\" Trump is about not being recorded when he's not aware of it. #BillyBush #GoldenGate",

[...]

Retweet:

{
  "created_at": "Thu Jan 12 14:40:44 +0000 2017",
  "id": 819554713083461632,
  "id_str": "819554713083461632",
  "text": "RT @GeorgeTakei: Because we all know how \"careful\" Trump is about not being recorded when he's not aware of it. #BillyBush #GoldenGate",

[...]

  "retweeted_status": {
    "created_at": "Thu Jan 12 00:36:22 +0000 2017",
    "id": 819342218611728384,
    "id_str": "819342218611728384",
    "text": "Because we all know how \"careful\" Trump is about not being recorded when he's not aware of it. #BillyBush #GoldenGate",
[...]

So retweets have an additional set of elements in the JSON body, under the retweeted_status element. We can pick this out using the get method as seen in this code snippet, where tweet is a Python object created from a json.loads from the JSON of the tweet:

if tweet.get('retweeted_status'):  
    print 'Tweet is a retweet'
else:  
    print 'Tweet is original'
Contains a URL, and URL is not on Whitelist

Twitter are very good to us in the JSON they supply for each tweet. Every possible attribute of the tweet is encoded as elements in the JSON, meaning that we don't have to do any nasty parsing of the tweet text itself. To find out if there are URLs in the tweet, we just check the entities.urls element, and iterate through the array if present.

if not tweet.get('entities'):  
    print 'no entities element'
else:  
    if not tweet.get('entities').get('urls'):
        print 'no entities.urls element'

The URL itself is again provided to us by Twitter as the expanded_url within the urls array, and using the urlsplit library as I did previously we can extract the domain:

for url in tweet['entities']['urls']:

    expanded_url = url['expanded_url']
    domain = urlsplit(expanded_url).netloc

With the domain extracted, we can then compare it to a predefined whitelist so that we don't pick up tweets that are just linking back to sites such as Spotify, iTunes, etc. Here I'm using the Python set type and issubset method to compare the list of domain(s) that I've extracted from the tweet into the url_info list, against the whitelist:

if set(url_info['domain']).issubset(domain_whitelist):  
    print 'All domains whitelisted'
The Stream Processing Bit

With me so far? We've looked at the requirements for what our stream processing needs to do, and worked out the prototype code that will do this. Now we can jump into the actual streaming code. You can see the actual notebook here if you want to try this yourself.

Job control variables
batchIntervalSec=30  
windowIntervalSec=21600 # Six hours  
app_name = 'spark_twitter_enrich_and_count_rm_01e'  
Make Kafka available to Jupyter
import os  
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages org.apache.spark:spark-streaming-kafka-0-8_2.11:2.0.2 pyspark-shell'  
os.environ['PYSPARK_PYTHON'] = '/opt/conda/envs/python2/bin/python'  
Import dependencies

As well as Spark libraries, we're also bringing in the KafkaProducer library which will enable us to send messages to Kafka. This is in the kafka-python package. You can install this standalone on your system, or inline as done below.

# Necessary to make Kafka library available to pyspark
os.system("pip install kafka-python")

#    Spark
from pyspark import SparkContext  
#    Spark Streaming
from pyspark.streaming import StreamingContext  
from pyspark.streaming.kafka import KafkaUtils  
#    Kafka
from kafka import SimpleProducer, KafkaClient  
from kafka import KafkaProducer

#    json parsing
import json  
#    url deconstruction
from urlparse import urlsplit  
#    regex domain handling
import re  
Define values to match
filters=[]  
filters.append({"tag":"album","value": "Glory Days","match":["Glory Days","GloryDays"]})  
filters.append({"tag":"artist","value": "Little Mix","match":["Little Mix","LittleMix","Litel Mixx"]})  
filters.append({"tag":"track","value": "Shout Out To My Ex","match":["Shout Out To My Ex","Shout Out To My X","ShoutOutToMyEx","ShoutOutToMyX"]})  
filters.append({"tag":"track","value": "F.U.","match":["F.U","F.U.","FU","F U"]})  
filters.append({"tag":"track","value": "Touch","match":["Touch"]})  
filters.append({"tag":"track","value": "Oops","match":["Oops"]})  
Define whitelisted domains
domain_whitelist=[]  
domain_whitelist.append("itun.es")  
domain_whitelist.append("wikipedia.org")  
domain_whitelist.append("twitter.com")  
domain_whitelist.append("instagram.com")  
domain_whitelist.append("medium.com")  
domain_whitelist.append("spotify.com")  
Function: Unshorten shortened URLs (bit.ly etc)
# Source: http://stackoverflow.com/a/4201180/350613
import httplib  
import urlparse

def unshorten_url(url):  
    parsed = urlparse.urlparse(url)
    h = httplib.HTTPConnection(parsed.netloc)
    h.request('HEAD', parsed.path)
    response = h.getresponse()
    if response.status/100 == 3 and response.getheader('Location'):
        return response.getheader('Location')
    else:
        return url
Function: Send messages to Kafka

To inspect the Kafka topics as messages are sent use:

kafka-console-consumer --zookeeper cdh57-01-node-01.moffatt.me:2181 --topic twitter_matched2  

N.B. following the Design Patterns for using foreachRDD guide here.

# http://spark.apache.org/docs/latest/streaming-programming-guide.html#design-patterns-for-using-foreachrdd

def send_to_kafka_matched(partition):  
    from kafka import SimpleProducer, KafkaClient
    from kafka import KafkaProducer

    kafka_prod = KafkaProducer(bootstrap_servers='cdh57-01-node-01.moffatt.me:9092')
    for record in partition:
        kafka_prod.send('twitter_matched2', str(json.dumps(record)))

def send_to_kafka_notmatched(partition):  
    from kafka import SimpleProducer, KafkaClient
    from kafka import KafkaProducer

    kafka_prod = KafkaProducer(bootstrap_servers='cdh57-01-node-01.moffatt.me:9092')
    for record in partition:
        kafka_prod.send('twitter_notmatched2', str(record))

def send_to_kafka_err(partition):  
    from kafka import SimpleProducer, KafkaClient
    from kafka import KafkaProducer

    kafka_prod = KafkaProducer(bootstrap_servers='cdh57-01-node-01.moffatt.me:9092')
    for record in partition:
        kafka_prod.send('twitter_err2', str(record))
Function: Process each tweet

This is the main processing code. It implements all of the logic described in the requirements above. If a processing condition is not met, the function returns a negative code and description of the condition that was not met. Errors are also caught and returned.

You can see a syntax-highlighted version of the code in the notebook here.

def process_tweet(tweet):  
    # Check that there's a URLs in the tweet before going any further
    if tweet.get('retweeted_status'):
        return (-1,'retweet - ignored',tweet)

    if not tweet.get('entities'):
        return (-2,'no entities element',tweet)

    if not tweet.get('entities').get('urls'):
        return (-3,'no entities.urls element',tweet)

    # Collect all the domains linked to in the tweet
    url_info={}
    url_info['domain']=[]
    url_info['primary_domain']=[]
    url_info['full_url']=[]
    try:
        for url in tweet['entities']['urls']:
            try:
                expanded_url = url['expanded_url']
            except Exception, err:
                return (-104,err,tweet)

            # Try to resolve the URL (assumes it's shortened - bit.ly etc)
            try:
                expanded_url = unshorten_url(expanded_url)
            except Exception, err:
                return (-108,err,tweet)

            # Determine the domain
            try:
                domain = urlsplit(expanded_url).netloc
            except Exception, err:
                return (-107,err,tweet)
            try:
                # Extract the 'primary' domain, e.g. www36.foobar.com -> foobar.com
                #
                # This logic is dodgy for UK domains (foo.co.uk, foo.org.uk, etc) 
                # since it truncates to the last two parts of the domain only (co.uk)
                #
                re_result = re.search('(\w+\.\w+)$',domain)
                if re_result:
                    primary_domain = re_result.group(0)
                else:
                    primary_domain = domain
            except Exception, err:
                return (-105,err,tweet)

            try:
                url_info['domain'].append(domain)
                url_info['primary_domain'].append(primary_domain)
                url_info['full_url'].append(expanded_url)
            except Exception, err:
                return (-106,err,tweet)


        # Check domains against the whitelist
        # If every domain found is in the whitelist, we can ignore them
        try:
            if set(url_info['primary_domain']).issubset(domain_whitelist):
                return (-8,'All domains whitelisted',tweet)
        except Exception, err:
            return (-103,err,tweet)

        # Check domains against the blacklist
        # Unless a domain is found, we ignore it
        #Only use this if you have first defined the blacklist!
        #if not set(domain_blacklist).intersection(url_info['primary_domain']):
        #    return (-9,'No blacklisted domains found',tweet)


    except Exception, err:
        return (-102,err,tweet)

    # Parse the tweet text against list of trigger terms
    # --------------------
    # This is rather messy iterative code that maybe can be optimised
    #
    # Because match terms are not just words, it's not enough to break
    #  up the tweet text into words and match against the filter list.
    #  Instead we have to take each filter term and see if it exists
    #  within the tweet text as a whole
    #
    # Using a set instead of list so that duplicates aren't added
    #
    matched=set()
    try:
        for f in filters:
            for a in f['match']:
                tweet_text = tweet['text']
                match_text = a.decode('utf-8')
                if match_text in tweet_text:
                    matched.add((f['tag'],f['value']))
    except Exception, err:
        return (-101,err,tweet)

    #-----
    # Add the discovered metadata into the tweet object that this function will return
    try:
        tweet['enriched']={}
        tweet['enriched']['media_details']={}
        tweet['enriched']['url_details']=url_info
        tweet['enriched']['match_count']=len(matched)
        for match in matched:
            tweet['enriched']['media_details'][match[0]]=match[1]

    except Exception, err:
        return (-100,err,tweet)

    return (len(matched),tweet)
Function: Streaming context definition

This is the function that defines the streaming context. It needs to be a function because we're using windowing and so the streaming context needs to be configured to checkpoint.

As well as processing inbound tweets, it performs counts of:

  • Inbound
  • Outbound, by type (match/no match/error)
  • For matched tweets, top domains and artists

The code is commented inline to explain how it works. You can see a syntax-highlighted version of the code in the notebook here.

def createContext():  
    sc = SparkContext(appName="spark_twitter_enrich_and_count_01")
    sc.setLogLevel("WARN")
    ssc = StreamingContext(sc, batchIntervalSec)

    # Define Kafka Consumer and Producer
    kafkaStream = KafkaUtils.createStream(ssc, 'cdh57-01-node-01.moffatt.me:2181', app_name, {'twitter':1})

    ## Get the JSON tweets payload
    ## >>TODO<<  This is very brittle - if the Kafka message retrieved is not valid JSON the whole thing falls over
    tweets_dstream = kafkaStream.map(lambda v: json.loads(v[1]))

    ## -- Inbound Tweet counts
    inbound_batch_cnt = tweets_dstream.count()
    inbound_window_cnt = tweets_dstream.countByWindow(windowIntervalSec,batchIntervalSec)

    ## -- Process
    ## Match tweet to trigger criteria
    processed_tweets = tweets_dstream.\
        map(lambda tweet:process_tweet(tweet))

    ## Send the matched data to Kafka topic
    ## Only treat it as a match if we hit at least two of the three possible matches (artist/track/album)
    ## 
    ## _The first element of the returned object is a count of the number of matches, or a negative 
    ##  to indicate an error or no URL content in the tweet._
    ## 
    ## _We only want to send the actual JSON as the output, so use a `map` to get just this element_

    matched_tweets = processed_tweets.\
                        filter(lambda processed_tweet:processed_tweet[0]>1).\
                        map(lambda processed_tweet:processed_tweet[1])

    matched_tweets.foreachRDD(lambda rdd: rdd.foreachPartition(send_to_kafka_matched))
    matched_batch_cnt = matched_tweets.count()
    matched_window_cnt = matched_tweets.countByWindow(windowIntervalSec,batchIntervalSec)

    ## Set up counts for matched metadata
    ##-- Artists
    matched_artists = matched_tweets.map(lambda tweet:(tweet['enriched']['media_details']['artist']))

    matched_artists_batch_cnt = matched_artists.countByValue()\
                                    .transform((lambda foo:foo.sortBy(lambda x:-x[1])))\
                                    .map(lambda x:"Batch/Artist: %s\tCount: %s" % (x[0],x[1]))

    matched_artists_window_cnt = matched_artists.countByValueAndWindow(windowIntervalSec,batchIntervalSec)\
                                    .transform((lambda foo:foo.sortBy(lambda x:-x[1])))\
                                    .map(lambda x:"Window/Artist: %s\tCount: %s" % (x[0],x[1]))

    ##-- Domains
    ## Since url_details.primary_domain is an array, need to flatMap here
    matched_domains = matched_tweets.flatMap(lambda tweet:(tweet['enriched']['url_details']['primary_domain']))

    matched_domains_batch_cnt = matched_domains.countByValue()\
                                    .transform((lambda foo:foo.sortBy(lambda x:-x[1])))\
                                    .map(lambda x:"Batch/Domain: %s\tCount: %s" % (x[0],x[1]))
    matched_domains_window_cnt = matched_domains.countByValueAndWindow(windowIntervalSec,batchIntervalSec)\
                                    .transform((lambda foo:foo.sortBy(lambda x:-x[1])))\
                                    .map(lambda x:"Window/Domain: %s\tCount: %s" % (x[0],x[1]))

    ## Display non-matches for inspection
    ## 
    ## Codes less than zero but greater than -100 indicate a non-match (e.g. whitelist hit), but not an error

    nonmatched_tweets = processed_tweets.\
        filter(lambda processed_tweet:(-99<=processed_tweet[0]<=1))

    nonmatched_tweets.foreachRDD(lambda rdd: rdd.foreachPartition(send_to_kafka_notmatched))

    nonmatched_batch_cnt = nonmatched_tweets.count()
    nonmatched_window_cnt = nonmatched_tweets.countByWindow(windowIntervalSec,batchIntervalSec)

    ##  Print any erroring tweets
    ## 
    ## Codes less than -100 indicate an error (try...except caught)

    errored_tweets = processed_tweets.\
                        filter(lambda processed_tweet:(processed_tweet[0]<=-100))
    errored_tweets.foreachRDD(lambda rdd: rdd.foreachPartition(send_to_kafka_err))

    errored_batch_cnt = errored_tweets.count()
    errored_window_cnt = errored_tweets.countByWindow(windowIntervalSec,batchIntervalSec)

    ## Print counts
    inbound_batch_cnt.map(lambda x:('Batch/Inbound: %s' % x))\
        .union(matched_batch_cnt.map(lambda x:('Batch/Matched: %s' % x))\
            .union(nonmatched_batch_cnt.map(lambda x:('Batch/Non-Matched: %s' % x))\
                .union(errored_batch_cnt.map(lambda x:('Batch/Errored: %s' % x)))))\
        .pprint()

    inbound_window_cnt.map(lambda x:('Window/Inbound: %s' % x))\
        .union(matched_window_cnt.map(lambda x:('Window/Matched: %s' % x))\
            .union(nonmatched_window_cnt.map(lambda x:('Window/Non-Matched: %s' % x))\
                .union(errored_window_cnt.map(lambda x:('Window/Errored: %s' % x)))))\
        .pprint()

    matched_artists_batch_cnt.pprint()
    matched_artists_window_cnt.pprint()
    matched_domains_batch_cnt.pprint()
    matched_domains_window_cnt.pprint()

    return ssc
Start the streaming context
ssc = StreamingContext.getOrCreate('/tmp/%s' % app_name,lambda: createContext())  
ssc.start()  
ssc.awaitTermination()  
Stream Output Counters

From the stdout of the job we can see the simple counts of inbound and output splits, both per batch and accumulating window:

-------------------------------------------
Time: 2017-01-13 11:50:30
-------------------------------------------
Batch/Inbound: 9
Batch/Matched: 0
Batch/Non-Matched: 9
Batch/Errored: 0

-------------------------------------------
Time: 2017-01-13 11:50:30
-------------------------------------------
Window/Inbound: 9
Window/Non-Matched: 9

-------------------------------------------
Time: 2017-01-13 11:51:00
-------------------------------------------
Batch/Inbound: 21
Batch/Matched: 0
Batch/Non-Matched: 21
Batch/Errored: 0

-------------------------------------------
Time: 2017-01-13 11:51:00
-------------------------------------------
Window/Inbound: 30
Window/Non-Matched: 30

The details of identified artists within tweets is also tracked, per batch and accumulated over the window period (6 hours, in this example)

-------------------------------------------
Time: 2017-01-12 12:45:30
-------------------------------------------
Batch/Artist: Major Lazer       Count: 4
Batch/Artist: David Bowie       Count: 1

-------------------------------------------
Time: 2017-01-12 12:45:30
-------------------------------------------
Window/Artist: Major Lazer      Count: 1320
Window/Artist: Drake    Count: 379
Window/Artist: Taylor Swift     Count: 160
Window/Artist: Metallica        Count: 94
Window/Artist: David Bowie      Count: 84
Window/Artist: Lady Gaga        Count: 37
Window/Artist: Pink Floyd       Count: 11
Window/Artist: Kate Bush        Count: 10
Window/Artist: Justice  Count: 9
Window/Artist: The Weeknd       Count: 8
Kafka Output Matched

The matched Kafka topic holds a stream of tweets in JSON format, with the discovered metadata (artist/album/track) added. I'm using the Kafka console consumer to view the contents, parsed through jq to show just the tweet text and metadata that has been added.

kafka-console-consumer --zookeeper cdh57-01-node-01.moffatt.me:2181 \  
--topic twitter_matched1 \
--from-beginning|jq -r "[.text,.enriched.url_details.primary_domain[0],.enriched.media_details.artist,.enriched.media_details.album,.enriched.media_details.track,.enriched.match_count] "
[
  "Million Reasons by Lady Gaga - this is horrendous sorry @ladygaga https://t.co/rEtePIy3OT",
  "youtube.com",
  "https://www.youtube.com/watch?v=NvMoctjjdhA&feature=youtu.be",
  "Lady Gaga",
  null,
  "Million Reasons",
  2
]
Non-Matched

On our Kafka topics outbound we can see the non-matched messages. Probably you'd disable this stream once the processing logic was finalised, but it's useful to be able to audit and validate the reasons for non-matches. Here a retweet is ignored, and we can see it's a retweet from the RT prefix of the text field. The -1 is the return code from the process_tweets function denoting a non-match:

(-1, 'retweet - ignored', {u'contributors': None, u'truncated': False, u'text': u'RT @ChartLittleMix: Little Mix adicionou datas para a Summer Shout Out 2017 no Reino Unido https://t.co/G4H6hPwkFm', u'@timestamp': u'2017-01-13T11:45:41.000Z', u'is_quote_status': False, u'in_reply_to_status_id': None, u'id': 819873048132288513, u'favorite_count': 0, u'source':
Summary

In this article we've built on the foundations of the initial exploration of Spark Streaming on Python, expanding it out to address a real-world processing requirement. Processing unbounded streams of data this way is not as complex as you may think, particularly for the benefits that it can yield in reducing the latencies between an event occuring and taking action from it.

We've not touched on some of the more complex areas, such as scaling this up to multiple Spark nodes, partitioned Kafka topics, and so on - that's another [kafka] topic (sorry...) for another day. The code itself is also rudimentary - before moving it into Production there'd be some serious refactoring and optimisation review to be performed on it.

You can find the notebook for this article here, and the previous article's here.

If you'd like more information on how Rittman Mead can help your business get the most of out its data, please do get in touch!

Categories: BI & Warehousing

Oracle Sort Order Issue

Tom Kyte - Fri, 2017-01-13 09:46
Hi Tom, According to Oracle document, the sort order depends on database option NLS_SORT, and there are only 2 kinds of sort: Binary Sort and Linguistic Sort. Is there any other options to change the sort order? Currently, the following sql wi...
Categories: DBA Blogs

Converting mixed Latin + Japanese text to Shift-JIS

Tom Kyte - Fri, 2017-01-13 09:46
Hi Tom, I have a string of mixed Latin and Japanese characters that must be converted to Shift-JIS when output to a particular report (not always). The DB character set is AL32UTF8. When I execute this statement: <code> convert('H. pylori?????...
Categories: DBA Blogs

Insert with select subquery hangs for a long time

Tom Kyte - Fri, 2017-01-13 09:46
<b>Oracle database version:</b> Oracle Database 11g Enterprise Edition Release 11.2.0.4.0 - 64bit Production PL/SQL Release 11.2.0.4.0 - Production "CORE 11.2.0.4.0 Production" TNS for 64-bit Windows: Version 11.2.0.4.0 - Production NLSRTL Versi...
Categories: DBA Blogs

Calculating last 5 business working days

Tom Kyte - Fri, 2017-01-13 09:46
Hi Tom, I want to get last 5 business days. If its friday today, I need from passed monday to this friday and if its monday today...last tuesday to this monday is required. Saturday and sunday need to skipped in calculating last 5 business days. ...
Categories: DBA Blogs

Frename function not working consistently in PL/SQL

Tom Kyte - Fri, 2017-01-13 09:46
Hi, I have a PL/SQL proc to rename files under directory. I am using the below function to rename the files :- UTL_FILE.FRENAME('ATTDIR', FILE_NAME2, 'ATTDIR', V_ID1, TRUE); <b>The files are existing on the directory</b> even though the P...
Categories: DBA Blogs

Imported Wrong data

Tom Kyte - Fri, 2017-01-13 09:46
Hi Team, Today by mistake I imported Wrong data from production to test server Db. I want to restore test tables to their original state .I don't have backup of test server data.I tried flash but it is not helpful here. What should I do to get ...
Categories: DBA Blogs

Is it possible to install 11G and 12C in Windows 7 64 bit.

Tom Kyte - Fri, 2017-01-13 09:46
Hi, Can you help me to resolve the issue. Scenario 1: I have 11G 64 bits running fine in Windows 64 bits but When I was trying to install 12C 64 bits in same system, Its throwing an error while creating database through DBCA. Error is: Error e...
Categories: DBA Blogs

Performance - Number vs. VARCHAR2 datatypes

Tom Kyte - Fri, 2017-01-13 09:46
Hello, I am designing a system that will store and query billions of IP addresses, both IPv4 and IPv6. I was wondering - (a) Will storing the IP as an integer (e.g Number datatype) instead of a string (e.g. VARCAHR2) give better performance in ...
Categories: DBA Blogs

Multiple DB triggers on a table for a given ledger or OU or Inventory Org

Tom Kyte - Fri, 2017-01-13 09:46
Hi, We have a requirement where the plan is to track any inserts/updates/deletes on some EBS tables. A company has operations across countries & continents. Can database triggers be created on the same table to track changes based on the Ledger (S...
Categories: DBA Blogs

Lost all controlfiles and not having database tablespace and datafiles information to create a new control file

Tom Kyte - Fri, 2017-01-13 09:46
Hello Tom/Maria/Team We ran a bad task in a cron in linux which remove all files modified 1 day ago, the database was up&running but idle=>without use bc it is test db, then the task removed control files and spfile bc it was down but I started it...
Categories: DBA Blogs

OPN Oracle Cloud Subscription, Pricing and Metering

Exclusive Live Webcast ...

We share our skills to maximize your revenue!
Categories: DBA Blogs

Pages

Subscribe to Oracle FAQ aggregator