Feed aggregator

Bulk collect into an existing collection that already has data

Tom Kyte - Sat, 2017-08-26 14:26
Can I bulk collect into an existing collection that already has data and preserve the original data? In other words append the new data to the existing data? The output I'm looking for would be (order not important): QWERTY ASDF Haikus are e...
Categories: DBA Blogs

Postgres vs. Oracle access paths XI – Sample Scan

Yann Neuhaus - Sat, 2017-08-26 09:33

I was going to end this series with the previous post because the last access path available in Postgres is a bit special: a Seq Scan that returns only a sample of the rows, at random. However, it is the occasion to come back to the difference between random and sequential reads.

I’m still working on the same table as in the previous posts, with 10000 rows in 1429 pages. 5% of rows is 500 rows and 5% of blocks is about 72 pages.

Rows

Sometimes, you can answer your business question on a sample of rows, when you need an approximate result, trend or pattern Let’s say that you want to sum() on only 5 percent of rows:
explain (analyze,verbose,costs,buffers) select sum(a) from demo1 tablesample bernoulli(5) ;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------
Aggregate (cost=1435.25..1435.26 rows=1 width=8) (actual time=1.940..1.940 rows=1 loops=1)
Output: sum(a)
Buffers: shared hit=1429
-> Sample Scan on public.demo1 (cost=0.00..1434.00 rows=500 width=4) (actual time=0.007..1.890 rows=509 loops=1)
Output: n, a, x
Sampling: bernoulli ('5'::real)
Buffers: shared hit=1429
Planning time: 0.373 ms
Execution time: 1.956 ms

This row sampling reads all rows and picks a sample of them at random. Unfortunately, it reads all blocks because you cannot get a good sample if you don’t know how many rows you have in each block. Working on a sample can make sense if you want to apply complex operations on the result. Here the cost in the database is similar to a Seq Scan: 1429 blocks read at seq_page_cost=1, but the sum() applied on 500 rows (cpu_operator_cost=0.0025) and 500 tuples from the scan and 1 tuple for the result, with cpu_tuple_cost=0.01

From execution statistics, you can see that the result is exactly what we asked: 500 rows returned.

Oracle has a different syntax and different algorithm:
PLAN_TABLE_OUTPUT
------------------------------------------------------------------------------------------------------------------------------------------------------
SQL_ID 1tsadjdd9ddam, child number 0
-------------------------------------
select /*+ */ sum(a) from demo1 sample(5)
-----------------------------------------------------------------------------------------------------
| Id | Operation | Name | Starts | E-Rows | Cost (%CPU)| A-Rows | A-Time | Buffers |
-----------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | | 397 (100)| 1 |00:00:00.01 | 581 |
| 1 | SORT AGGREGATE | | 1 | 1 | | 1 |00:00:00.01 | 581 |
| 2 | TABLE ACCESS SAMPLE| DEMO1 | 1 | 500 | 397 (0)| 478 |00:00:00.01 | 581 |
-----------------------------------------------------------------------------------------------------
Column Projection Information (identified by operation id):
-----------------------------------------------------------
1 - (#keys=0) SUM("A")[22] 2 - (rowset=256) "A"[NUMBER,22]

Here we have not read all the blocks. Only 40% of them. This is faster than the Postgres approach, but the drawback is that the result is not exact: 478 rows were returned here.

Blocks

When we can afford an approximate sampling, we can sample on blocks rather than on rows:
explain (analyze,verbose,costs,buffers) select sum(a) from demo1 tablesample system(5) ;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------
Aggregate (cost=290.25..290.26 rows=1 width=8) (actual time=0.479..0.480 rows=1 loops=1)
Output: sum(a)
Buffers: shared hit=73
-> Sample Scan on public.demo1 (cost=0.00..289.00 rows=500 width=4) (actual time=0.016..0.377 rows=511 loops=1)
Output: n, a, x
Sampling: system ('5'::real)
Buffers: shared hit=73
Planning time: 0.698 ms
Execution time: 0.509 ms

The number of rows is still good here, but the result may depend on the blocks sampled. Only 73 blocks were read, which is exactly 5% and of course, the rows may be distributed differently within the blocks. However, the advantage is that it is faster as it reads less blocks. But those blocks being picked at random, they are by definition random reads: 71 pages read at random_page_cost=0:4 and, as in the previous case, 501 cpu_tuple_cost and 500 cpu_operator_cost

With block sampling, Oracle reads a smaller number of blocks than with row sampling, but still more than 5%, and the number of rows is not exact: 798 rows here:
PLAN_TABLE_OUTPUT
------------------------------------------------------------------------------------------------------------------------------------------------------
SQL_ID fqgbwqfavgdrn, child number 0
-------------------------------------
select /*+ */ sum(a) from demo1 sample block(5)
-----------------------------------------------------------------------------------------------------
| Id | Operation | Name | Starts | E-Rows | Cost (%CPU)| A-Rows | A-Time | Buffers |
-----------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | | 22 (100)| 1 |00:00:00.01 | 134 |
| 1 | SORT AGGREGATE | | 1 | 1 | | 1 |00:00:00.01 | 134 |
| 2 | TABLE ACCESS SAMPLE| DEMO1 | 1 | 500 | 22 (0)| 798 |00:00:00.01 | 134 |
-----------------------------------------------------------------------------------------------------
Column Projection Information (identified by operation id):
-----------------------------------------------------------
1 - (#keys=0) SUM("A")[22] 2 - (rowset=256) "A"[NUMBER,22]

Again, as for the previous access paths: same concepts and different implementation between Postgres and Oracle. Everything looks similar and easily portable from a far overview, but going into details you see all those little differences which make it no so easy to be database agnostic or easily portable.

Summary

This is the end of this series comparing Postgres access path with Oracle ones. The goal is not to tell you that one is better than the other. They have a different approach, different targets, different price, different history and probably future. But understanding how they work and how they estimate the cost is a good way to learn. I’m myself learning a lot about Postgres while writing those posts, matching things I discover on Postgres with those I know for a while in Oracle.

Here is the list of posts on Access Path:

  1. Postgres vs. Oracle access paths – intro
  2. Postgres vs. Oracle access paths I – Seq Scan
  3. Postgres vs. Oracle access paths II – Index Only Scan
  4. Postgres vs. Oracle access paths III – Partial Index
  5. Postgres vs. Oracle access paths IV – Order By and Index
  6. Postgres vs. Oracle access paths V – FIRST ROWS and MIN/MAX
  7. Postgres vs. Oracle access paths VI – Index Scan
  8. Postgres vs. Oracle access paths VII – Bitmap Index Scan
  9. Postgres vs. Oracle access paths VIII – Index Scan and Filter
  10. Postgres vs. Oracle access paths IX – Tid Scan
  11. Postgres vs. Oracle access paths X – Update
  12. Postgres vs. Oracle access paths XI – Sample Scan

I think my next series will be on Join methods.

 

Cet article Postgres vs. Oracle access paths XI – Sample Scan est apparu en premier sur Blog dbi services.

#GoldenGate Classic vs Microservices Architecture

DBASolved - Fri, 2017-08-25 23:30

With the release of Oracle GoldenGate 12c (12.3.0.1.0) came the introduction of a new architecture that can be used to replicat your business information. This architecture had many different names over the years as it was being developed; which we (Oracle) finally settled on the name of “Microservices” Architecture as we got closer to general release. There are many benefits to the Microservices Architecture and these benefits should give you pause to look closely at this new feature for Oracle GoldenGate 12c.

Before we get into the Microservices Architecture, let’s review the Classic Architecture. In the below image, you see a pretty standard Oracle GoldenGate implementation.

In this architecture, the primary access into the Oracle GoldenGate enviornmentis is through the GoldenGate Service Command Interface (GGSCI). After logging into GGSCI, you can interact and administrate the associated processes, i.e. Manager, Extract (Capture), Data Pump, and Replicat (Apply). The Collectors are pretty much hidden on the target systems, but they are there; just not seen through GGSCI. In this architecture, you data is replicated over TCP/IP between the Data Pump process and Collectors using local and remote trail files.

The downside to this architecture is that, in order to administer the environment, you have to physically login to the server where Oracle GoldenGate is running. With this requirement, many organizations were restrictive on who had access to the server and often caused a debate over who were truely the owners of Oracle GoldenGate.

Althought the Classic Architecture of Oracle GoldenGate has been a bedrock of replication for nearly 20 years, we (Oracle) wanted to leverage that bedrock to transform the way we (industries) replicate data today. This lead to the more flexible and super scalable Microservices Architecture. As you may have guessed, “Microservices” is the mechanism that we are suing to provide access to the “RESTful API” end points. By using RESTful APIs, we (Oracle) have taken a huge leap forward in the replication space. We have broken down the limitations we had around administration and access, while at the same time remaining true to the bedrock that is the foundation of Oracle GoldenGate.

The below image is a view of a simple Oracle GoldenGate Microservices Architecture, for you to review.

As you will notice there are some traditional components of Oracle GoldenGate missing, while there are still extracts (capture), trail files, and replicats (apply). This is due to being a completely new replication architecture which had many benefits. A few of these benefits are:

  • Remote Administration
  • SSL Support
  • HTML 5 webpages for each service/server
  • Additional replication protocols (WSS, WS, UDT, OGG)
  • Real-time Performance Metrics

In order to understand the Microservices Architecture, you have to understand what each of the servers (or services) provide within the architecture. So, let’s take a moment and talk about these items starting with the ServiceManager.

ServiceManager:
The ServiceManager is the watchdog process for the architecture on each server in the replication enviornment. Ideally, you should only have one of these processes running. This process can be configured to run in one of 3 ways. These ways are:

  • Manually
  • As a daemon
  • Integrated with XAG

While the ServiceManager is running, this process will be the main entry point into the Oracle GoldenGate environment. During the configuration process, you will be asked to assign ports for each the servers to run on. The ServiceManager will be the first port you assign. From the HTML5 page of the ServiceManager, you will be able to see all of your deployment homes and associated servers.

AdminServer:
The AdminServer is the service that will take the place GGSCI (don’t worry, we still have a command line in this architecture) and Manager in the Classic Architecture. From here, you will be able to setup your credential store, extract and replicats. Most Oracle GoldenGate Administrators will spend their time here. Additionally, from this service you can drill into the running process and review current status, statistics, parameter files, and report file. Making your administration in general much simpler.

DistributionServer:
The DistributionServer is the replacement for the Data Pump Extract. The service performs all the same functionality as the Data Pump Extract with the exception of transformations. Besides providing all the same functionality, you also get a visual representation of where your trail file is being read from and shipped to. It is very clear to see from the overview page of this services. As you dig into the details of the DistributionServer, you can see the statistics on what is being read and written to trail files and adjust TCP/IP items within the distribution path.

ReceiverServer:
The ReceiverServer is the replacement for the Collectors. The whole job of the ReceiverServer is to accept transmissions from the DistributionServer and write out the remote trail files. From the overview page of this service, you can clearly see where the information is coming from and what trail it is writing to. Just like the DistributionServer, if you look at the details of this service you can see alot of useful information.

Lastly, is the most interesting of the services with the Microservices Architecture. This would be:

Performance Metrics Server:
Finally, we (Oracle) have provide a real-time performance monitoring services with Oracle GoldenGate. Before you get all happy about having a new way to monitor performance, you must have a license for the Oracle Managment Pack for GoldenGate before you can use the GUI or associated metric APIs. If you have that in place, there is so much performance metric information you can retrieve and use in both the GUI and APIs. I would encourage you to take a look.

With that my friends, hopefully, you are a bit excited about using the new Microservices Architecture. There is so much you can do with this architecture and it is going to change how we replicat data, both on-primese, in the cloud, and in hybrid environments.

Enjoy!!!


Filed under: Golden Gate
Categories: DBA Blogs

Top N salaries but department with hightest overall salary should be top/first

Tom Kyte - Fri, 2017-08-25 20:06
Top N salaries but department with hightest overall salary should be top , then dept with 2nd highest in the dept (with rank 1) should come next . How do I achieve this SQL query ? Here is sample data : <code>create table emp_sal_test ( dept# nu...
Categories: DBA Blogs

partition_extension_clause in where clause

Tom Kyte - Fri, 2017-08-25 20:06
In the Oracle 12c documentation SQL Language Reference, for the SQL SELECT statement syntax, under the partition_extension_clause section, there is this statement: <i>For PARTITION or SUBPARTITION, specify the name or key value of the partition or...
Categories: DBA Blogs

Using Dynamic Table Name in Select Statement To Prepare Extract File

Tom Kyte - Fri, 2017-08-25 20:06
Hi. We use to prepare extract file from oracle using extract script. There are few tables whose name gets change every month. Need to handle these table names dynamically. <code>SELECT TICKET_ID ,SOURCE_ID ,SERV_ID ,...
Categories: DBA Blogs

Pete Finnigan is now an Oracle ACE

Pete Finnigan - Fri, 2017-08-25 20:06
I just got an email from the Oracle ACE program to tell me that I had been accepted onto the ACE program and was awarded the Oracle ACE status by Oracle. I have been active on the internet around Oracle....[Read More]

Posted by Pete On 25/08/17 At 07:28 PM

Categories: Security Blogs

Oracle Security at UKOUG December 2017

Pete Finnigan - Fri, 2017-08-25 20:06
I have just had an email from the UKOUG to say that three of my presentations have been accepted for the upcoming conference on December 4th to 6th at the ICC in Birmingham. I will have one talk on the....[Read More]

Posted by Pete On 25/08/17 At 04:16 PM

Categories: Security Blogs

The Week After: Game of Thrones S07 E06 Tweets and Press Reviews Analysis

Rittman Mead Consulting - Fri, 2017-08-25 09:56
 Game of Thrones S07 E06 Tweets and Press Reviews Analysis

Another week is gone, another "Game of Thrones" episode watched, only one left until the end of the 7th series.
The "incident" in Spain, with the episode released for few hours on Wednesday screwed all my plans to do a time-wise comparison between episodes across several countries.

 Game of Thrones S07 E06 Tweets and Press Reviews Analysis

I was then forced to think about a new action plan in order avoid disappointing all the fans who enjoyed my previous blog post about the episode 5. What you'll read in today's analysis is based on the same technology as before: Kafka Connect source from Twitter and Sink to BigQuery with Tableau analysis on top.

 Game of Thrones S07 E06 Tweets and Press Reviews Analysis

What I changed in the meantime is the data structure setup: in the previous part there was a BigQuery table rm_got containing #GoT tweets, an Excel table containing Keywords for each character together with the Name and the Family (or House). Finally there was a view on top of BigQuery rm_got table extracting all the words of each tweet in order to analyse their sentiment.
For this week analysis I tried to optimise the dataflow, mainly pushing data into BigQuery, and I added a new part to it: online press reviews analysis!

Optimization

As mentioned during my previous post, the setup described before was miming an analyst workflow, without writing access to datasource. However it was far from optimal performance wise, since there was a cartesian join between two data-sources, meaning that for every query all the dataset was extracted from BigQuery and then joined in memory in Tableau even if filters for a specific character were included.

The first change was pushing the characters Excel data in BigQuery, so at least we could use the same datasource joins instead of relying on Tableau's data-blend. This has the immediate benefit of running joins and filters in the datasource rather than retrieving all data and filtering locally in memory.
Pushing Excel data into BigQuery is really easy and can be done directly in the web GUI, we just need to transform the data in CSV which is one of allowed input data formats.

 Game of Thrones S07 E06 Tweets and Press Reviews Analysis

Still this modification alone doesn't resolve the problem of the cartesian join between characters (stored in rm_characters) and the main rm_got table since also BigQuery native joining conditions don't allow the usage of the CONTAIN function we need to verify that the character Key is contained in the Tweet's Text.
Luckily I already had the rm_words view, used in the previous post, splitting the words contained in the Tweet Text into multiple rows. The view contained the Tweet's Id and could be joined with the characters data with a = condition.

However my over simplistic first implementation of the view was removing only # and @ characters from the Tweet text, leaving all the others punctuation signs in the words as you can see in the image below.

 Game of Thrones S07 E06 Tweets and Press Reviews Analysis

I replaced the old rm_words view code with the following

SELECT  id, TEXT, SPLIT(REGEXP_REPLACE(REPLACE(UPPER(TEXT), 'NIGHT KING', 'NIGHTKING'),'[^a-zA-Z]',' '),' ')  f0__group.word FROM [big-query-ftisiot:BigQueryFtisiotDataset.rm_got]  

Which has two benefits:

  • REPLACE(UPPER(TEXT), 'NIGHT KING', 'NIGHTKING'): Since I'm splitting words, I don't want to miss references to the Night King which is composed by two words that even if written separated point the same character.
  • REGEXP_REPLACE(..,'[^a-zA-Z]',' '): Replaces using regular expression, removing any character apart from the letters A-Z in lower and upper case from the Tweet Text.

The new view definition provides a clean set of words that can finally be joined with the list of characters keys. The last step I did to prepare the data was to create an unique view containing all the fields I was interested for my analysis with the following code:

SELECT  
  rm_got.Id,
  rm_got.Text,
  rm_got.CreatedAt,
  [...]
  characters.Key,
  characters.Name,
  characters.Family
FROM  
  [DataSet.rm_got] AS rm_got JOIN
  [DataSet.rm_words] AS rm_words ON rm_got.id=rm_words.id JOIN 
  (SELECT * FROM FLATTEN([DataSet.rm_words],f0__group.word)) AS rm_words_char ON rm_got.id=rm_words_char.id JOIN 
  [DataSet.rm_charachters] AS characters ON rm_words_char.f0__group.word = characters.Key

Two things to notice:

  • The view rm_words is used two times: one, as mentioned before, to join the Tweet with the character data and one to show all the words contained in a tweet.
  • The (SELECT * FROM FLATTEN([DataSet.rm_words],f0__group.word)) subselect is required since word column, contained in rm_words, was a repeated field, that can't be used in joining condition if not flatten.

 Game of Thrones S07 E06 Tweets and Press Reviews Analysis

Please note that the SQL above will still duplicate the Tweet rows, in reality we'll have a row for each word and different character Key contained in the Text itself. Still this is a big improvement from the cartesian join we used in our first attempt.

One last mention to optimizations: currently the sentence and word sentiment is calculated on the fly in Tableau using the SCRIPT_INT function. This means that data is extracted from BigQuery into Tableau, then passed to R (running locally in my pc) which computes the score and then returns it to Tableau. In order to optimize Tableau performance I could pre-compute the scores in R and push them in a BigQuery Table but this would mean a pre-processing step that I wanted to avoid since a real-time analysis was one of my purposes.

 Game of Thrones S07 E06 Tweets and Press Reviews Analysis

Tweet Analysis

With my tidy dataset in place, I can now start the analysis and, as the previous week I can track various KPIs like the mentions by character Family and Name. To filter only current week data I created two parameters Start Date of Analysis and End Date of Analysis

 Game of Thrones S07 E06 Tweets and Press Reviews Analysis

Using those parameters I can filter which days I want to include in my analysis. To apply the filter in the Workbook/Dashboard I created also a column Is Date of Analysis with the following formula

IIF(DATE([CreatedAt]) >= [Start Date of Analysis]  
AND DATE([CreatedAt]) <= [End Date of Analysis]  
,'Yes','No')

I can now use the Is Date of Analysis column in my Workbooks and filter the Yes value to retain only the selected dates.

I built a dashboard containing few of the analysis mentioned in my previous blog post, in which I can see the overall scatterplot of characters by # of Tweets and Sentence Sentiment and click on one of them to check its details regarding the most common words used and sentence sentiment.

 Game of Thrones S07 E06 Tweets and Press Reviews Analysis

From the scatterplot on top we can see a change of leadership in the # of Tweets with Daenerys overtaking Jon by a good margin, saving him and in the meantime loosing one of the three dragons was a touching moment in the episode. When clicking on Daenerys we can see that the world WHITE is driving also the positive sentiment.

 Game of Thrones S07 E06 Tweets and Press Reviews Analysis

The Night King keep its leadership on the Sentiment positive side. Also in this case the WHITE word being the most used with positive sentiment. On the other side Arya overtook Sansa as character with most negative mentions. When going in detail on The positive/negative words, we can clearly see that STARK (mentioned in previous episode), KILL, WRONG and DEATH are driving the negative sentiment. Interesting is also the word WEAR with negative sentiment (from Google dictionary "damage, erode, or destroy by friction or use.").


 Game of Thrones S07 E06 Tweets and Press Reviews Analysis

A cut down version of the workbook with a limited dataset, visible in the image below, is available in Tableau Public.

 Game of Thrones S07 E06 Tweets and Press Reviews Analysis

Game of Couples

 Game of Thrones S07 E06 Tweets and Press Reviews Analysis

This comparison is all what I promised towards the end of my first post, so I could easily stop here. However as curious person and #GoT fan myself I wanted to know more about the dataset and in particular analyse how character interaction affect sentiment. To do so I had somehow to join characters together if they were mentioned in the same tweet, luckily enough my dataset contained the character mentioned and the list of words of each Tweet. I can reuse the list of words on a left join with the list of characters keys. In this way I have a record for each couple of characters mentioned in a Tweet.

 Game of Thrones S07 E06 Tweets and Press Reviews Analysis

I can then start analysing the Tweets mentioning any couple of characters, with the # of Tweets driving the gradient. As you can see I removed the values where the column and row is equal (e.g. Arya and Arya). The result, as expected, is a symmetric matrix since the # of Tweets mentioning Arya and Sansa is the same as the ones mentioning Sansa and Arya.

 Game of Thrones S07 E06 Tweets and Press Reviews Analysis

We can clearly see that Jon and Daenerys are the most mentioned couple with Sansa and Arya following and in third place Whitewalkers and Bran. This view and the insights we took from it could be problematic to get in cases when the reader is colour blind or has troubles when defining intensity. For those cases a view like the below provides the same information (by only switching the # of Tweets column from Color to Size), however it has the drawback that small squares are hard to see.

 Game of Thrones S07 E06 Tweets and Press Reviews Analysis

The next step in my "couple analysis" is understand sentiment, and how a second character mentioned in the same tweet affects the positive/negative score of a character. The first step I did is showing the same scatterplot as before, but filtered for a single character, in this case Arya.

 Game of Thrones S07 E06 Tweets and Press Reviews Analysis

The graph shows Arya's original position, and how the Sentiment and the # of Tweets change the position when another character is included in the Tweet. We can see that, when mentioned with Daenerys the sentiment is much more positive, while when mentioned with Bran or Littlefinger the sentiment remains almost the same.

This graph it's very easy to read, however it has the limitation of being able to display only one character behaviour at time (in this case Arya). What I wanted is to show the same pattern across all characters in a similar way as when analysing the # of Tweets per couple. To do so I went back to a matrix stile of visualization, setting the colour based on positive (green) or negative (red) sentiment.

 Game of Thrones S07 E06 Tweets and Press Reviews Analysis

As before the matrix is symmetric, and provides us a new set of insights. For example, when analysing Jorah Mormont, we can see that a mention together with Cercei is negative which we can somehow expect due to the nature of the queen. What's strange is that also when Jorah is mentioned with Samwell Tarly there is a negative feeling. Looking deeply in the data we can see that it's due to a unique tweet containing both names with a negative sentiment score.

 Game of Thrones S07 E06 Tweets and Press Reviews Analysis

What's missing in the above visualization is an indication on how "strong" is the relationship between two character based on the # of Tweets where they are mentioned together. We can add this by including the # of Tweets as position of the sentiment square. The more the square is moved towards the right the higher is the # of Tweets mentioning the two characters together.

 Game of Thrones S07 E06 Tweets and Press Reviews Analysis

We can see as before that Jorah and Sam have a negative feeling when mentioned together, but it's not statistically significant because the # of Tweets is very limited (square position completely on the left). Another example is Daenerys and Jon which have a lot of mentions together with a neutral sentiment. As we saw before also the couple Arya and Bran when mentioned together have a negative feeling, with a limited number Tweets mentioning them together. However Bran mentioned with WhiteWalkers has a strong positive sentiment.

It's worth mentioning that the positioning of the dot is based on a uniform scale across the whole matrix. This means that if, like in our case, there is a dominant couple (Daenerys and Jon) mentioned by a different order of magnitude of # of Tweets compared to all other couples, the difference in positioning of all the others dots will be minimal. This could however be solved using a logarithmic scale.

Web Scraping

Warning: all the analysis done in the article including this chapter are performed with automated tools. Due to the nature of the subject (a TV series plenty of deaths, battles and thrilling scenes) the words used to describe a sentence could be automatically classified as positive/negative. This doesn't automatically mean that the opinion of the writer is either positive or negative about the scene/episode/series.

The last part of the analysis I had in mind was about comparing the Tweets sentiment, with the same coming from the episode reviews that I could find online. This latter part relies a lot on the usage of R to scrape the relevant bits from the web-pages, the whole process was:

  • Search on Google for Beyond the Wall Reviews
  • Take the top N results
  • Scrape the review from the webpage
  • Tokenize the review in sentences
  • Assign the sentence score using the same method as in Tableau
  • Tokenize the sentence in words
  • Upload the data into BigQuery for further analysis

Few bits on the solution I've used to accomplish this since the reviews are coming from different websites with different tags, classes and Ids, I wasn't able to write a general scraper for all websites. However each review webpage I found had the main text divided in multiple <p> tags under a main <div> tag which had an unique Id or class. The R code simply listed the <div> elements, found the one mentioning the correct Id or class and took all the data contained inside the <p> elements. A unique function is called with three parameters: website, Id or class to look for, and SourceName (e.g. Telegraph). The call to the function is like

sentence_df <- scrapedata("http://www.ign.com/articles/2017/08/21/game-of-thrones-beyond-the-wall-review",'Ign',"article-content")  

It will return a dataframe containing one row per <p> tag, together with a mention of the source (Ign in this case).

The rest of the R code tokenizes the strings and the words using the tokenizers package and assigns the related sentiment score with the syuzhet package used in my previous blog post. Finally it creates a JSON file (New Line Delimited) which is one of the input formats accepted by BigQuery.
When the data is in BigQuery, the analysis follows the same approach as before with Tableau connecting directly to BigQuery and using again R for word sentiment scoring.

The overall result in Tableau includes a global Episode sentiment score by Source, the usual scatterplot by character and the same by Source. Each of the visualizations can act as filter for the others.

 Game of Thrones S07 E06 Tweets and Press Reviews Analysis

We can clearly see that AVClub and Indiewire had opposite feelings about the episode. Jon Snow is the most mentioned character with Arya and Sansa overtaking Daenerys.

The AVClub vs Indiewire scoring can be explained by the sencence sentiment categorization. Indiewire had most negative sentences (negative evaluations) while the distribution of AVClub has its peak on the 1 (positive) value.

 Game of Thrones S07 E06 Tweets and Press Reviews Analysis

Checking the words used in the two Sources we can notice as expected a majority of positive for AVClub while Indiewire has the overall counts almost equal.

 Game of Thrones S07 E06 Tweets and Press Reviews Analysis

Going in detail on the words, we can see the positive sentiment of AVClub being driven by ACTION, SENSE, REUNION while Indiewire negative one due to ENEMY, BATTLE, HORROR.

 Game of Thrones S07 E06 Tweets and Press Reviews Analysis

This is the automated overall sentiment analysis, if we read the two articles from Indiewire and AVClub in detail we can see that the overall opinion is not far from the automated score:

From AVClub

On the level of spectacle, “Beyond The Wall” is another series high point, with stellar work ....

From IdieWire

Add to the list “Beyond the Wall,” an episode that didn’t have quite the notable body count that some of those other installments did

To be fair we also need to say that IdieWire article is focused on the war happening and the thrilling scene with the Whitewalkers where words like ENEMY, COLD, BATTLE, DEATH which have a negative sentiment are actually only used to describe the scene and not the feelings related to it.

Character and Review Source Analysis

The last piece of analysis is related to single characters. As mentioned before part of the dashboard built in Tableau included the Character scatterplot and the Source scatterplot. By clicking on a single Character I can easily filter the Source scatterplot, like in this case for Daenerys.

 Game of Thrones S07 E06 Tweets and Press Reviews Analysis

We can see how different Sources have different average sentiment score for the same character, in this case with Mashable being positive while Pastemagazine negative.

 Game of Thrones S07 E06 Tweets and Press Reviews Analysis

Checking the words mentioned we can clearly see a positive sentiment related to PRESENT, AGREED and RIDER for Mashable while the negative sentiment of Pastemagazine is driven by FIGHT, DANGER, LOOSING. As said before just few words of difference describing the same scene can make the difference.

Finally, one last sentence for the very positive sentiment score for Clegor Clegaine: it is partially due to the reference to his nickname, the Mountain, which is used as Key to find references. The mountain is contained in a series of sentences as reference to the place where the group of people guided by Jon Snow are heading in order to find the Whitewalkers. We could easily remove MOUNTAIN from the Keywords to eliminate the mismatch.

 Game of Thrones S07 E06 Tweets and Press Reviews Analysis

We are at the end of the second post about Game of Thrones analysis with Tableau, BigQuery and Kafka. Hope you didn't get bored...see you next week for the final episode of the series! And please avoid waking up with blue eyes!

via GIPHY

Categories: BI & Warehousing

Point the Camera at Your Face: Where is Your 2 Minute Tech Tip?

OTN TechBlog - Fri, 2017-08-25 06:00

The very first video in the community-driven 2 Minute Tech Tip video series was published on August 29, 2014. Since then 230 more 2MTT videos have been recorded and published, featuring insight and expertise from community members around the world, covering a wide variety of topics and technologies.

And now it's your turn.

Sharing your technical skill in a 2 Minute Tech Tip is easy. Got a webcam on your laptop? Got a smart phone? Either one will do the job. Just point the camera at your face (or vice-versa) and start talking. But instead of sharing the details of your vacation trip ("Hey! Does this look infected to you?"), you're going to share some of the technical skill you've accumulated as a software developer, solution architect, DBA, SysAdmin, or whatever.

Once you've recorded your video, send it to me. I take care of the rest.

As an alternative to recording your own tip, I can record you remotely via Skype.

Even better, if you're going to be at Oracle OpenWorld or JavaOne this year, I can record your tip in person. Just contact me to get on my schedule.

I'm also happy to share technical guidelines and other tips to help make your tip a good one.

So let's do this!

Related Content

 

 

DBMS_FILE_TRANSFER unix permissions settings

Tom Kyte - Fri, 2017-08-25 01:46
Hi Tom, We are having different unix development environment .when I'm using DBMS_FILETRANSFER.Fcopy in one environment it is creating the file with Rw permission to oracle used only. In another environment it is creating the file rw to user orac...
Categories: DBA Blogs

Passing encoded url string containing %3A (colon) truncates everting after the last %3A

Tom Kyte - Fri, 2017-08-25 01:46
Hi I am a relative newbie to APEX from Sweden (not so god at english) I am passing an url string to an apex application where i have encode all special caracters, if there i a %3A (colon) in the url APEX seemes to truncate the value i am passin...
Categories: DBA Blogs

RAC Vs Master/Slave configuration

Tom Kyte - Fri, 2017-08-25 01:46
Hi Tom, I am lil bit confused about Master slave..... could you please explain what are the different between Oracle RAC and Master slave.... why we use master slave instead of RAC or vice-versa.
Categories: DBA Blogs

Oracle 12c DB creation

Tom Kyte - Fri, 2017-08-25 01:46
Hi Tom, ?I have 2 source systems ( A & B ) each running with 10 numbers of Oracle11gR1 databases. Now I need to migrate this 20 databases from 2 sources into 1 target system(C) which is running in 3 Node Oracle 12c RAC cluster. I proposed 1 conta...
Categories: DBA Blogs

Exporting a table that contains a CLOB column

Tom Kyte - Fri, 2017-08-25 01:46
Hi Ask-Tom-team, What is the recommended way for exporting a table that contains a CLOB column to a SQL script? The standard feature of the SQL Developer only exports all columns that are not CLOBs. I want to export the table to a SQL script for i...
Categories: DBA Blogs

Assinging Batch Numbers to Query Rows

Tom Kyte - Fri, 2017-08-25 01:46
Oracle Database 11g Enterprise Edition Release 11.2.0.3.0 - 64bit Production create table orders ( order_number number, order_line number) order_number line_number 50000 1 50000 2 50000 3 50010 1 50010 2 50300 3 50301 1 5030...
Categories: DBA Blogs

Partner Webcast – Oracle IaaS and Database Cloud: Sales Tools and Programs for Partners

Businesses are no longer asking if they should move to the cloud—but when and how. Most Oracle partners have a good understanding of the overall benefits of the cloud and the specific...

We share our skills to maximize your revenue!
Categories: DBA Blogs

Postgres vs. Oracle access paths X – Update

Yann Neuhaus - Thu, 2017-08-24 15:03

In the previous post we have seen the cheapest way to get one row, reading only one block from its physical location. But that’s the optimal case where the row has not moved. I’ll (nearly) conclude this series about access path with an update.

ROWID in Oracle

Here is the ROWID of one row in Oracle:

select rowid from demo1 where n=1000;
ROWID
------------------
AAAR4WAAMAAAAEaAAF

There’s enough information here to get directly to the block with file_name and offset:
select file_name,dbms_rowid.rowid_block_number('AAAR4WAAMAAAAEaAAF')*block_size offset
from dba_data_files join dba_tablespaces using(tablespace_name)
where file_id=dbms_rowid.rowid_to_absolute_fno('AAAR4WAAMAAAAEaAAF','DEMO','DEMO1');
 
FILE_NAME OFFSET
---------------------------------------- ----------
/u01/oradata/CDB1A/PDB/users01.dbf 2310144

The ROWID also contains the index of the row within the block’s row directory:

select dbms_rowid.rowid_row_number('AAAR4WAAMAAAAEaAAF') from dual;
 
DBMS_ROWID.ROWID_ROW_NUMBER('AAAR4WAAMAAAAEAAAF')
-------------------------------------------------
5

TID in Postgres

And the TID of similar row in Postgres:

select ctid from demo1 where n=1000;
ctid
---------
(142,6)

The file is known from the table, as there is only one file per table:

show data_directory;
data_directory
----------------------------
/usr/share/postgresql/data
 
select pg_relation_filepath('demo1');
pg_relation_filepath
----------------------
base/16437/125852

The blocksize is common for the whole database:

show block_size;
block_size
------------
8192

Then the block is at offset 142+8192=8334.
Within the block, the row is at index 6.

SELECT

We have seen in the previous post that we can select using the ROWID/TID and Oracle and Postgres behave the same: only one block to read, cost estimation based on one random read:

PLAN_TABLE_OUTPUT
------------------------------------------------------------------------------------------------------------------------------------------------------
SQL_ID 8mj3ms08x0sfh, child number 0
-------------------------------------
select /*+ */ a from demo1 where rowid='AAAR4WAAMAAAAEaAAF'
-----------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Starts | E-Rows | Cost (%CPU)| A-Rows | A-Time | Buffers |
-----------------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | | 1 (100)| 1 |00:00:00.01 | 1 |
| 1 | TABLE ACCESS BY USER ROWID| DEMO1 | 1 | 1 | 1 (0)| 1 |00:00:00.01 | 1 |
-----------------------------------------------------------------------------------------------------------
Column Projection Information (identified by operation id):
-----------------------------------------------------------
1 - "A"[NUMBER,22]

Different units but same signification: cost=1 for Oracle is for random reads, cost=1 for Postgres is for sequential reads and random reads are estimated to cost=4:

explain (analyze,verbose,costs,buffers) select a from demo1 where ctid='(142,6)' ;
QUERY PLAN
------------------------------------------------------------------------------------------------------
Tid Scan on public.demo1 (cost=0.00..4.01 rows=1 width=4) (actual time=0.007..0.007 rows=1 loops=1)
Output: a
TID Cond: (demo1.ctid = '(142,6)'::tid)
Buffers: shared hit=1
Planning time: 0.358 ms
Execution time: 0.016 ms

Oracle UPDATE

Now I’m updating this row, changing the column X which contains 1000 ‘x’ characters to 1000 ‘y’ characters:

PLAN_TABLE_OUTPUT
------------------------------------------------------------------------------------------------------------------------------------------------------
SQL_ID gpqv4k6m1q20y, child number 0
-------------------------------------
update /*+ */ demo1 set x=lpad('y',1000,'y') where rowid='AAAR4WAAMAAAAEaAAF'
------------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Starts | E-Rows | Cost (%CPU)| A-Rows | A-Time | Buffers |
------------------------------------------------------------------------------------------------------------
| 0 | UPDATE STATEMENT | | 1 | | 1 (100)| 0 |00:00:00.01 | 4 |
| 1 | UPDATE | DEMO1 | 1 | | | 0 |00:00:00.01 | 4 |
| 2 | TABLE ACCESS BY USER ROWID| DEMO1 | 1 | 1 | 1 (0)| 1 |00:00:00.01 | 1 |
------------------------------------------------------------------------------------------------------------
Column Projection Information (identified by operation id):
-----------------------------------------------------------
2 - (upd=2) ROWID[ROWID,10], "X"[VARCHAR2,1000]

In addition to the access to the block (1 buffer) the update had to read 3 additional buffers. There are no indexes on this updated column and then Oracle has no additional maintenance to do. One buffer is the table block to update (the TABLE ACCESS BY USER ROWID was a consistent get, the update needs the current version of the block).

Additional buffers are from the UNDO tablespace for MVCC (Multi Version Concurrency Control). It is the first modification in my transaction and then has to update the transaction table and undo segment, which is why we see 2 additional buffers. Another update within the same transaction reads only two buffers in total:

PLAN_TABLE_OUTPUT
------------------------------------------------------------------------------------------------------------------------------------------------------
SQL_ID gpqv4k6m1q20y, child number 0
-------------------------------------
update /*+ */ demo1 set x=lpad('z',1000,'z') where rowid='AAAR4WAAMAAAAEaAAF'
------------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Starts | E-Rows | Cost (%CPU)| A-Rows | A-Time | Buffers |
------------------------------------------------------------------------------------------------------------
| 0 | UPDATE STATEMENT | | 1 | | 1 (100)| 0 |00:00:00.01 | 2 |
| 1 | UPDATE | DEMO1 | 1 | | | 0 |00:00:00.01 | 2 |
| 2 | TABLE ACCESS BY USER ROWID| DEMO1 | 1 | 1 | 1 (0)| 1 |00:00:00.01 | 1 |
------------------------------------------------------------------------------------------------------------
Column Projection Information (identified by operation id):
-----------------------------------------------------------
2 - (upd=2) ROWID[ROWID,10], "X"[VARCHAR2,1000]

Only the table blocks are read: one consistent read as of the beginning of the query (or the transaction if in serializable isolation level) and one for the current block. Oracle has an optimization called In-Memory UNDO to avoid frequent access undo blocks.

There are no further re-visits needed. Oracle may choose to come back at commit if it can be done quickly (few blocks still in buffer cache) but that’s not required. The block can stay like this for years without the need to read it again for cleanup. If another session has to read it, then cleanup may be done by this session.

Postgres UPDATE

Here is the same update in Postgres:

explain (analyze,verbose,costs,buffers) update demo1 set x=lpad('y',1000,'y') where ctid='(142,6)' ;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Update on public.demo1 (cost=0.00..4.01 rows=1 width=46) (actual time=0.214..0.214 rows=0 loops=1)
Buffers: shared hit=6 dirtied=3
-> Tid Scan on public.demo1 (cost=0.00..4.01 rows=1 width=46) (actual time=0.009..0.009 rows=1 loops=1)
Output: n, a, 'yyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy'::text, ctid
TID Cond: (demo1.ctid = '(142,6)'::tid)
Buffers: shared hit=1
Planning time: 0.405 ms
Execution time: 0.232 ms

The Tid Scan is the same as for the select. Then the update has read 5 blocks and modified 3 of them. The update in Postgres is processed as a delete+insert. Here is my guess about those numbers. The new version is inserted, in a new block if there is no free space in the same block. The old version is updated. And the index must be maintained. Those are 3 blocks to modify. Here, the row was directly accessed through its TID. But we must find the index entry. The row contains the index value, and then an index scan is possible: two block reads for this small index having one branch only.

SELECT again

I said that with Oracle the row is updated in-place and doesn’t need further cleanup. If I run the same SELECT as the one I did before the UPDATE, I still have only one block to read:

PLAN_TABLE_OUTPUT
------------------------------------------------------------------------------------------------------------------------------------------------------
SQL_ID 8mj3ms08x0sfh, child number 0
-------------------------------------
select /*+ */ a from demo1 where rowid='AAAR4WAAMAAAAEaAAF'
-----------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Starts | E-Rows | Cost (%CPU)| A-Rows | A-Time | Buffers |
-----------------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | | 1 (100)| 1 |00:00:00.01 | 1 |
| 1 | TABLE ACCESS BY USER ROWID| DEMO1 | 1 | 1 | 1 (0)| 1 |00:00:00.01 | 1 |
-----------------------------------------------------------------------------------------------------------
Column Projection Information (identified by operation id):
-----------------------------------------------------------
1 - "A"[NUMBER,22]

In Postgres, because the update was processed as insert+delete, running the same also reads only one block, but it returns no rows:

explain (analyze,verbose,costs,buffers) select a from demo1 where ctid='(142,6)' ;
QUERY PLAN
------------------------------------------------------------------------------------------------------
Tid Scan on public.demo1 (cost=0.00..4.01 rows=1 width=4) (actual time=0.014..0.014 rows=0 loops=1)
Output: a
TID Cond: (demo1.ctid = '(142,6)'::tid)
Buffers: shared hit=1
Planning time: 0.442 ms
Execution time: 0.028 ms

The new version is in another block, then the TID to find it is different:

select ctid from demo1 where n=1000;
ctid
----------
(1428,5)
(1 row)

There was not enough space for another version of the whole row within the same block. Free space was found in the last block (1428). Of course, this is why the index was updated even if the indexed column did not change: it had to address a different block.

Let’s query with the new TID:

explain (analyze,verbose,costs,buffers) select a from demo1 where ctid='(1428,5)' ;
QUERY PLAN
------------------------------------------------------------------------------------------------------
Tid Scan on public.demo1 (cost=0.00..4.01 rows=1 width=4) (actual time=0.008..0.008 rows=1 loops=1)
Output: a
TID Cond: (demo1.ctid = '(1428,5)'::tid)
Buffers: shared hit=1
Planning time: 0.449 ms
Execution time: 0.023 ms

Only one buffer read. However, as we have seen with the Index Only Scan, there is a need for cleanup to avoid Heap Fetches. There are also the old tuples that should be removed later or the updated tables and indexes grow forever.

There’s only one Postgres access path remaining. That’s for teh last post of this series, which will include the table of content.

 

Cet article Postgres vs. Oracle access paths X – Update est apparu en premier sur Blog dbi services.

dbms_metadata - getting inconsistent SXML (ordering)

Tom Kyte - Thu, 2017-08-24 07:26
Hi Tom I am trying to use dbms_metadata to retrieve a SXML document per object in my production database, and compare it to the same document from my test database, in order to compare the two environments. Initially, I calculate a HASH value f...
Categories: DBA Blogs

Cloning the ORACLE_HOME

Tom Kyte - Thu, 2017-08-24 07:26
Please advise if the below steps are correct while copying the ORACLE_HOME from one location to another. Also advise if anything needs to be added to the below before we can start using the ORACLE_HOME. Copy ORACLE_HOME from old location to new...
Categories: DBA Blogs

Pages

Subscribe to Oracle FAQ aggregator