Friday, January 21, 2011

Troubleshooting an Ethernet/IP System

When troubleshooting any EtherNet/IP system, you must have a logical order to troubleshooting. The order for each troubleshooting issue is dependent on the details for that issue. This TechTip will list and detail, in order of priority, the troubleshooting steps for EtherNet/IP systems.

When troubleshooting Ethernet/IP systems, there are potentially many possible troubleshooting scenarios. In general, there are three types of problems:
  • It does not work at all
    Examples: an I/O node is not connected to a switch (missing cable), cannot ping a node, all MSG instruction to a specific Allen-Bradley® 1756-ENBT ControlLogix® EtherNet/IP Module fails.
  • It works but is too slow
    Example: A resource (PC, controller, 1756-ENBT) in the system is overloaded.
  • It works but fails intermittently
    Examples: The ControlLogix controller outgoing unconnected message buffer is being exceeded, Noise is causing an I/O connection to be lost.
Resolving the Problem
To resolve any of the above problems, you need to know where to look and what to examine. Check all of the following carefully as possible sources of the problem:
  • slow PC or slow application running on the PC
  • node configuration (IP address, etc.)
  • congested network (lots of traffic such as broadcast)
  • slow network (satellite or frame relay)
  • misconfigured switch or router
  • Logix controller resources
    - controller processing capability (5550, 5555, 5563)
    - timeslice for communications
    - cached message queue (32 max)
    - unconnected outgoing buffers (40 max)
  • insufficient processing capability in an ENBT module
  • duplicate IP addresses
  • defective Ethernet network hardware (e.g., cable, switch port, or ENBT module)
  • web server diagnostics or RSLinx® diagnostics
If you have addressed all the above issues and are still experiencing problems, noise could be the cause.

The steps below will provide general information to resolve any of the above problems. They do not detail individual troubleshooting possibilities.

The steps can be categorised as follows:
  • It does not work at all
    See Intermittent/No Response, Physical Layer
  • It works but is too slow
    See Logix Controller System Overhead, Module Device Capacity, I/O or Produce/Consume Tags, Rockwell Automation Ethernet NIC, Logix Controller outgoing unconnected message buffer, etc.
  • It works but fails intermittently
    See Switch configuration, I/O or Produce/Consume Tags, Logix Controller unconnected message buffer, etc.
The order of troubleshooting steps is important. Start with Step 1 and work your way down. Skip any steps that you know are not necessary.

Step 1: Intermittent or No Response

You may see the following when there is intermittent or no response:
  • "Request timed out" could result from numerous issues including target is powered down.
  • "Unknown host" means the specified IP address is bad, e.g., 255.255.255.255.
  • "Destination host is not reachable" could result from numerous issues including a bad cable.
When any of the above occur, check for the following:
  • AC power not applied
  • A missing or defective cable (a clue would be that the Link light is off or intermittent)
  • You did not configure the module
  • You did not completely configure the target node
    - including subnet mask and gateway
    Example: attempting to ping a module on a different subnet, and the subnet mask is set incorrectly or the gateway address is incorrect.
  • On some switches (e.g., Cisco 3550), port mirroring disables pinging (on the "mirror-to" port)
If replies are intermittent, ping continuously and record the deviation. If the jitter is more than 10ms or you skip a reply:
  • Something is busy (network or NIC) However, a busy 1756-ENBT probably won’t be the problem. From measurements, a 1756-ENBT running at 100% CPU Utilization replies in the range 10-16ms. If you find a heavily loaded interface, reduce the load to 90% or less to allow for some margin.
  • The network is long (satellite or Frame relay)
  • Noise is corrupting packets, and they are being dropped
    Example: ping -t 130.130.130.1
    This will ping continuously
If you can ping successfully, but the problem is not solved, continue with the next steps. For help with the Ping command, just enter Ping from a cmd screen (DOS screen). You could also use RSWho to test connectivity. However, ping is simpler to use and faster.

Step 2: Bad Hardware

If communications are consistently bad, replace suspect hardware to isolate the trouble area. Problems could include cables, the Rockwell Automation Ethernet interface (e.g., 1756-ENBT) and switch port.

The problem may also be old firmware or hardware. Record hardware and firmware versions and contact the appropriate vendor for update information.

Step 3: Switch Configuration, Autonegotiation or Hard-configuration

The autonegotiation specification (in the 802.3 standard) allows for interpretation by developers. The result is every vendor’s Autonegotiation firmware has similar, but not identical, functionality. If one node is configured for half-duplex and the other for full-duplex, random and possibly frequent communications will be lost.

To see the Rockwell Automation duplex/speed status, see Rockwell Automation web server diagnostics, Class 1 Packet Statistics. Verify that the status reported matches the switch configuration.

Example: If your switch is configured for Autonegotiation, the Rockwell Automation web server page should indicate Autonegotiated speed and duplex.

If you are running out of troubleshooting ideas, hard configure the speed and duplex on the switch ports and also on all Rockwell Automation nodes. This will eliminate one more variable.

With RSLogix version 12 software, you can hard configure speed and duplex. RSLinx version 2.41 software (build 10) does not yet support this feature.

Step 4: I/O or Produce/Consume Tags (class 1 messaging)

Look at Missed Frames in the web browser diagnostics (see detailed web server description in Step 12). This parameter is only for I/O or produce tag messaging.

Although some applications may still run when losing frames, you should strive for a system with zero (0) dropped frames.

Furthermore, if you are dropping at least four consecutive frames, you might be dropping a CIP connection. If you are dropping connections, this will definitely be incrementing. If you are not dropping connections, this may be incrementing if your system is not as stable as possible.

Viewing Missed Frames will help quantify a problem. The yellow triangles in the RSLogix 5000 software I/O Configuration tree will not be seen if a connection is lost and recovered quickly. However, the Missed Frames counter will see everything - even one missed frame. This counter is excellent for diagnostics because of its high resolution.

Step 5: EtherNet/IP Module Device Capacity

Use the web server to verify that CPU utilization on the Ethernet NIC is less than 100%. If utilization is at 100%, this may be the problem. To reduce the utilization:
  • Make I/O RPI values larger (slower)
  • Reduce the number of I/O connections
  • Make non-critical traffic less frequent (e.g., MSGs and HMI)
  • Add another EtherNet/IP module and divide the traffic load
Step 6: Logix Controller Outgoing Unconnected Message Buffer

ControlLogix controllers have a limit of 10 outgoing unconnected buffers. As of version 8, this can be increased to 40. These are required for all messaging - explicit and implicit to establish a connection.

If the controller tries to exceed this limit, it will fail. For example, if you try to initiate 50 MSG instructions simultaneously, those in excess of the buffer size will fail. See the Rockwell Automation Knowledgebase document G20181 for information on reading unconnected outgoing buffers.
- attribute 17 is reserve (unused)
- attribuite18 is high-water mark
- attribute 19 is buffers currently in use

Use RSLogix5000 version 12 software to read the above values reliably.

Step 7: Logix Controller System Overhead

Add more time for communications by increasing the continuous task timeslice or run the higher priority tasks (e.g., Periodic) tasks less frequently or at a lower priority. The default timeslice is 10%. Try increasing it to 30-50%.

Step 8: Slow PC Application

If your application is running slow, there are two possible reasons:
  • The PC is underpowered
  • The application runs slowly
    (or accesses controller data inefficiently)
In either case, look at the CPU utilization in the Windows® Task Manager to see how close it is to 100%.

You can also stop the application and use OPC test client (included with RSLinx software) to access all the data you need. Configure the topic poll rate for 1ms to operate it at the same speed as the Rockwell Automation controller(s). If you can achieve sufficient throughput using this approach, the problem is likely the application itself or an underpowered PC.

Step 9: Duplicate IP Address

If two Rockwell Automation nodes are duplicated, the last one to be configured will "steal" the IP address. When this happens, detection can be simple or difficult:
  • Simple Detection
    In the I/O tree, a 1794-AENT adapter is configured and operating well. However, a 17560ENBT module is then accidentally configured for the same address. When this situation occurs, the Logix controller declares the connection to the AENT adapter is lost.
  • Difficult Detection
    Messages (MSG instruction) from one ControlLogix controller to another are occurring. Then, after a third device is configured, the MSGs are failing. If you ping the IP address, it will ping OK. If the 3rd device is of the same type (e.g., 1756-ENBT) but does not have the desired tag, even RSWho will show good connectivity but the MSG will fail.
Work is in progress within ODVA EtherNet/IP to examine a standard mechanism to detect and defense against duplicate addresses.

Step 10: Network Trace

If you have yet to solve the problem, you need to examine the network. Take a trace of the network and analyze it for problems. If you are unable to do this, Rockwell Automation can provide assistance through our Network Services and Remote Support (see the Products, Services and Support section in the back of this publication for information).

While waiting for an analysis of the trace, you can look at the physical layer (see below).

Step 11: Noise or Intermittent Defective Hardware

If the preceding steps do not solve the problem, noise or bad hardware is the problem. Intermittent communication is most likely caused by one of the following:
  • Ethernet cable placement (visually inspect for cable placement next to 480VAC).
  • Noise/grounding (physically detach an intermittent chassis from the enclosure and see how it operates).
  • Intermittent hardware (focus on a communications problem between 2 nodes and try the following: replace a Rockwell Automation Ethernet interface, move the cat5 cable (from a Rockwell Automation node) to a different switch port, replace an Ethernet cable.

Step 12: Web Server Description

From the Rockwell Automation web server home page, the following parameters have proven useful when troubleshooting a system on one of the following modules: 1756-ENBT, 1788-ENBT, 1794-AENT, 1769-L35E (Other Rockwell Automation EtherNet/IP products currently do not use them but may in the future.)

In the Address field of Internet browser, enter the IP address of an Ethernet interface module (e.g.,10.88.76.96). You will see something similar to Figure 1.

Since it is probably the busiest, the Ethernet interface(s) within the controller chassis is where you should begin troubleshooting (as opposed to your other Rockwell Automation Ethernet modules such as ControlLogix, Flex I/O, etc.).

How many errors are too much? The answer to this question is application dependent. For example, if you have a single bad UDP checksum (caused by electrical noise) every 100 packets, that packet will be discarded. Some may say this not a problem because the production line is running fine. However, to others this is unacceptable.
Figure 1
Figure 2
 Up to this time, most requests for troubleshooting involved the I/O and produce tag. The diagnostics most useful I/O and produce tag are marked with an asterisk (*) below.

Backplane Statistics - Identifies backplane errors.

Connection Manager Statistics - Identifies if any Rejects or Timeouts are incrementing. Note: you can get the same info from RSLinx by right clicking on the Ethernet module and selecting Module Statistics and selecting Connection Manager.

Ethernet Statistics - Identifies Input/Output errors

TCP Statistics - Displays connection requests (outgoing from the controller thru an ENBT), connection accepts (incoming from the wire through an ENBT to a controller. These will increment while you are online with a web browser), and discards (bad packets that have been discarded)

UDP Statistics - This screen will increment only if other devices are sending non-CIP UDP packets to this module. At this time, no devices send non-CIP UDP packets to this module.

From testing with a produced tag (RPI=10ms), the total UDP packets and input UDP packets do increment (on the company network) but they increment at a rate of only 1-3 every 10-30 seconds. With an RPI of 10ms, the produce tag rate is 200 packets per second. The conclusion is that there is no relationship between CIP packets and UDP statistics. Without connecting Sniffer to investigate, the assumption is that someone in the building is sending multicast to all stations, including my ENBT module.

Also, the addition of CIP UDP checksum errors has formally been requested.

Encapsulation Statistics - Shows cumulative and active in/out TCP connections used for encapsulation (CIP) sessions.

The TCP statistics shown are for all TCP connections (e.g., CIP+ HTTP+ telnet, etc.).

Enet/IP (CIP) Statistics - Active Class 1 Transports provides the number of transports. In general, two (2) class 1 transports equate to a connection. Use this number to verify against your calculated class 1 total.

Class 3 transport information is supplied including client (outgoing) and server (incoming) details.

Unconnected message information is also provided. The UCMM Worst Backlog (Client) can be used to see the unconnected message high-water mark for messages to legacy PLCs. If this value is 10 and you have the Logix processor configured for a maximum of 10, you may be trying to exceed the controller’s limit.

Class 1 (CIP) Packet Statistics
  • Link Status* (including negotiation description)
  • Speed*
  • Duplex*
  • Method for selecting duplex and speed*
    (e.g., Autonegotiation)
  • CPU Utilization Percentage*
    (includes processing for everything on the module)
  • Current TCP connections (for all connections, class 1 and class 3, includes actual connections and ones being built but not yet complete)
  • Current incoming TCP connections (these are for all connections, class 1 and class 3)
  • Current outgoing TCP connections (for all connections, class 1 and class 3, includes actual connections and ones being built but not yet complete)
  • Actual class 1 packets per second* (for I/O and produce tag only, compare your calculated value to this number)
  • Reserve Class 1 capacity (displays how much is unused)
  • Total Missed Class 1 Packets* (for I/O and produce tag only)
Class 1 (CIP) Active Transports* - You should see only the RPIs you configured (e.g., If all your configured RPIs are 50ms, you should see only 50ms API).

Class 3 (CIP) Active Transports - For explicit messaging, transports are the same as connections. Examine the remote addresses. Verify that these are correct for your system.

Examine the number of Class 3 transports. The number of transports expected depends on what you are doing. Examples include:
  • RSLogix 5000 opens one CIP connection.
  • A PanelView Plus can use one or more depending on the volume of tags on scan. With 488 tags on scan (120 integers, 120 dints, 128 reals, 128 bools), a PanelView Plus (actually RSLinx Enterprise) opened three transports.
Step 13: RSlinx Diagnostics

From RSLinx, in RSWho, you can right click, select Module Statistics and select the tabs/links listed below.
  • Link name: General (this tab is self-descriptive)
  • Link name: Port Diagnostics
    Most of this information can also be found in the web server in the following places: Diagnostics - Ethernet Statistics, Diagnostics - TCP Statistics, Diagnostics - IP Statistics. There is often more information in the web server but you must look in three different places to see everything. Additionally, RSLinx Port Diagnostics shows some values (e.g., alignment errors) that are not seen in the web server.
It is recommended you look at RSLinx Port Diagnostics and note any errors.
  • Link name: Connection Manager
    (Same as Connection Manager in web server)
  • Link name: Backplane
    (Same as Backplane stats in web server)
References/AdditionalResources
  1. Noise
    • EtherNet/IP Media Planning and Installation Manual (Publication ENET-IN001A-EN-P)
    • Industrial Automation Wiring and Grounding Guidelines, 1770-4.1
    • GMC-RM001www.ab.com/manuals/gmc/ GMC-RM001A-EN-P-JUL01.pdf
  2. System Planning and Module Capacities
    • EtherNet/IP Performance and Application (Publication ENET-AP001C-EN-P)

2 comments:

  1. Wow this is an amazing post, huge help, thanks!

    ReplyDelete
  2. Learningplc: Troubleshooting An Ethernet/Ip System >>>>> Download Now

    >>>>> Download Full

    Learningplc: Troubleshooting An Ethernet/Ip System >>>>> Download LINK

    >>>>> Download Now

    Learningplc: Troubleshooting An Ethernet/Ip System >>>>> Download Full

    >>>>> Download LINK Wu

    ReplyDelete