RE: System stats

From: Mark W. Farnham <mwf_at_rsiz.com>
Date: Sun, 14 Apr 2019 08:58:03 -0400
Message-ID: <082101d4f2c1$b3439e30$19cada90$_at_rsiz.com>



This is an excellent summary of the problems with silos of "that's my turf, no, you don't get to look because then I'll have to answer your stupid beginner questions, and besides our service levels are being met" work minimization in place of cooperation.  

Plan A is to not have silos, but that is often organizationally difficult justified either by size of span of employee reports or segmentation for security (or other things that don't occur to me at the moment.)  

Plan B, if you have silos, is to have overall management that gives more "points" to solving problems than to never having problems. Then the silos are forthcoming and say to the group - "Hey, our service levels look good, but if you're having a throughput problem how can we reduce any component of the latency or lack of throughput that is from our resource." (Don't even bother trying to suppress that initial "Hey, our service levels look good." Even if you know damn well they are the problem, saying they are not meeting service levels or that they are the problem is contrary to getting the "how can we reduce any component." part that is needed.  

The classic sane SAN from James Morle and his update from the age of SSD and faster networks may help you understand a broader level than the very fine posts from dimension and Mladen.

Remember SLOB, if they will let you run it. Search Kevin Closson slob and a deployment guide should pop up. There is a very good chance you can sell that as the industry standard storage services for Oracle benchmarking tool.  

Plan C, if overall management believes in negative points and finding root cause of technical problems to punish the "guilty" more than having an efficient team, is to have a set of metrics as outlined below to "prove" it is a resource problem.  

By the way, I don't think anyone mentioned virtual LANs. One more nefarious (and possibly unintentional) Network silo reporting trick is to show just the load on your virtual channel, usually averaged over time that show they are meeting or exceeding service levels 95% of the time.  

Plan C happens far too often. So being ready with the fair objective process is prudent.  

Good luck.  

mwf  

From: oracle-l-bounce_at_freelists.org [mailto:oracle-l-bounce_at_freelists.org] On Behalf Of dimensional.dba_at_comcast.net Sent: Saturday, April 13, 2019 11:41 AM
To: gogala.mladen_at_gmail.com; oracle-l_at_freelists.org Subject: RE: System stats  

Pretty much every SAN vendor has monitoring tools at the SAN level. The real question is whether or not you will get access to them in your company.

There is a variety of information you can collect from the OS side that will say you are having an I/O problem. However is it storage being the problem, that is a longer discussion.

Example, at one client I was at, I reported to the VP of Storage, performing database perf tuning. We had Mladen's Netapps and the VP refused us having access to the storage monitoring tools. Basically as is normally the case you are told by the storage guys there is no problem on their side and therefore you then have to go through a lot of alternate testing and collection of information to prove it is the storage such as

  1. Check the settings on the iscsi adapters and test variations on not only the server you are running on but other servers with storage that don't point to the specific netapps you are having trouble with along with testing across all time periods of the day to show patterns and wide variations in the storage performance.
  2. Write code to measure and record I/O performance from inside the database and outside the database (because they will blame it on the database sw). Test outside the database with a variety of test tools using shell, java and c++, as at every level they will try to blame it on some perf tuning not performed at the app level. In this way you can normally show there is minimal or no deviation in performance from the different pathways and definitely nothing that matches the deviations in storage performance you are seeing..
  3. Try to obtain information from the network team on the pathways and switches that follow the pathways storage takes (this includes fiber pathways if you are using them for storage). Network personnel are normally worse than storage personnel, as they normally just say the path shows no errors or you are only at 35%-50% utilization you are good.
  4. Write code to the test the network along the same pathways or response times to the switches along the route. Again have to use multifaceted code (shell, java, c++ to eliminate challenges it must be something in the app).
  5. At least get a dump from the Storage team of every server that is using the same san/netapp as your database server so you can run tests and collect OS metrics from each of the servers across time.

After going through all that the VP then requested that each of the vendors come in (Blade Server, Virtualization, Network, Database, Storage) to examine the setup for best practices. After 2 months of vendor swirl of changes having nothing to do with Storage for best practice concepts, the Storage vendor decided that we of course needed to go to their all flash storage. We of course purchased a new unit and only I pressed for in the scenario was that if the problem really wasn't the storage then all the servers that were on the original storage need to move to the new storage for an apples to apples vindication.  

The storage of course didn't get a vindication as the performance even with SSD's was within 5% just as bad as it was before.

In the end the bad server as the storage guys put it, was moved to it's own san and everything else was happy at that point.  

If each side would simply provide all information relative to monitoring at the detailed level, not the average level at hourly increments, then you could quickly resolve issues. With SANS you have to always look at all connections to the storage in performing your analysis.      

From: oracle-l-bounce_at_freelists.org <oracle-l-bounce_at_freelists.org> On Behalf Of Mladen Gogala
Sent: Friday, April 12, 2019 4:18 PM
To: oracle-l_at_freelists.org
Subject: Re: System stats  

You can monitor your SAN with Nagios agent for NetApp. If your SAN is not NetApp, throw it away and purchase NetApp. Not knowing what SAN equipment you have, that's the best advice I can offer. It is a general advice: if you are not using NetApp SAN, throw it away and buy NetApp. NetApp has your data on tap, like beer.

Regards

Disclosure:

I don't work for NetApp, but I really, really like their equipment.  

On 4/12/19 10:23 AM, (Redacted sender Jay.Miller for DMARC) wrote:

Can you recommend any sort of monitoring to identify when a SAN is getting overloaded? In our case it only became apparent when an app started experiencing latency at the same time for 5-10 minutes every day and we tracked it down to a batch job which was running on an entirely different cluster but which shared the same storage unit. Storage denied it was their problem right up until the point we proved it was.  

It would have been nice to have known that before the problems started showing up. Getting a new storage unit is a slow process.      

Jay Miller

Sr. Oracle DBA

201.369.8355  

From: oracle-l-bounce_at_freelists.org [mailto:oracle-l-bounce_at_freelists.org] On Behalf Of Neil Chandler
Sent: Tuesday, March 26, 2019 10:35 AM
To: Chris Taylor
Cc: gogala.mladen_at_gmail.com; ORACLE-L Subject: Re: System stats  

In the majority of places I have worked - 5 clients last year - the SANs were overloaded in 4 of them. They are too frequently sized for capacity and not throughput/response time. The response time was inevitably variable and System Stats would not have been helpful on the systems they have. In one of the clients, some of the critical DB's have dedicated storage but changing the system stats would have had little to no effect on those systems due to other measures having been put in place (including using a low optimizer index cost adj on one system, meaning lots of index use. Just not necessarily the right indexes.)  

The optimizer tries to be all things to all people, and there's lots of parameters to try to twist it into the shape that you want. The problem is frequently the abuse of those parameters - especially the global ones - via googling a problem, believing a silver bullet blog, and the lack of time to prove the solution so we just throw the fix into the system. It can be enlightening to strip the more extreme parameters back to their defaults and see how the system copes.  

As an aside, did you run your systems with the default parameters, discover notable problems and then use the 2 sets of system stats to correct those problems, or did you put them in from the start and everything was good?  

There's a case to be made for using system stats, but I just don't think that is something that should be used frequently.    

Neil.  


From: Chris Taylor <mailto:christopherdtaylor1994_at_gmail.com> <christopherdtaylor1994_at_gmail.com>
Sent: 26 March 2019 12:59
To: Neil Chandler
Cc: gogala.mladen_at_gmail.com; ORACLE-L Subject: Re: System stats  

As far as the workload, I used 2 workload stats and swapped between them - one for the day where the business hours and the off-business hours had their own personalities (for lack of a better word).  

As far as the SAN goes, if enough systems are hitting the SAN enough to cause the IO rate/throughput to become affected, then its *probably* time for a new SAN.      

Chris    

-- 
Mladen Gogala
Database Consultant
Tel: (347) 321-1217



--
http://www.freelists.org/webpage/oracle-l
Received on Sun Apr 14 2019 - 14:58:03 CEST

Original text of this message