BGP deep dive

BGP Basic Nomenclature

BGP speaker – A router that implements BGP

BGP identifier – Unique identifier of BGP speaker (IP address)

Autonomous System (AS) -The classic definition of an Autonomous System is a set of routers under a single technical administration, using an interior gateway protocol (IGP) and common metrics to determine how to route packets within the AS, and using an inter-AS routing protocol to determine how to route packets to other ASes. Since this classic definition was developed, it has become common for a single AS use several IGPs and, sometimes, several sets of metrics within an AS. The use of the term Autonomous System stresses the fact that, even when multiple IGPs and metrics are used, the administration of an AS appears to other ASes to have a single coherent interior routing plan, and presents a consistent picture of the destinations that are reachable through it.

External peer – Peer that is in different AS than the local system

Internal peer – Peer that is in the same AS as the local system

IGP – Interior Gateway Protocol, a protocol used for exchange routing information within one AS. (OSPF, EIGRP, ISIS, RIP)

EBGP – External BGP (connection between external peers)

IBGP – Internal BGP (connection between internal peers)

NLRI – Network Layer Reachability Information. Fancy word for route prefixes (IP address, mask and next hop). NLRI is used in BGP update messages and contains one or more prefixes. It is standard name (RFC4271) and was used in past by CISCO CLI when configuring BGP. Check the pictures below:

Route – Unit of information that contains a path to certain destination.

RIB – Routing Information Base. BGP got 3 routing information bases. Adj-RIB-In, Adj-RIB-Out, Loc-RIB.

Adj-RIB-In – contains unprocessed routing information that has been advertised to the local BGP speaker by its peers.

Loc-RIB – contains the routes that have been selected by the local BGP speaker’s Decision Process.

Adj-RIB-Out – contains the routes for advertisement to specific peers by means of the local speaker’s UPDATE messages.

BGP General Information

The Border Gateway Protocol (BGP) is an inter-Autonomous System routing protocol. The primary function of a BGP speaking system is to exchange network reachability information with other BGP systems. Routing information exchanged via BGP supports only the destination- based forwarding paradigm, which assumes that a router forwards a packet based solely on the destination address carried in the IP header of the packet.

BGP uses TCP [RFC793] as its transport protocol. This eliminates the need to implement explicit update fragmentation, retransmission, acknowledgement, and sequencing. BGP listens on TCP port 179.

Incremental updates are sent as the routing tables change. BGP does not require a periodic refresh of the routing table.

To allow local policy changes to have the correct effect without resetting any BGP connections, a BGP speaker SHOULD either (a) retain the current version of the routes advertised to it by all of its peers for the duration of the connection (soft-reconfiguration inbound), or (b) make use of the Route Refresh extension [RFC2918].

BGP does not enable one AS to send traffic to a neighboring AS for forwarding to some destination (reachable through but) beyond that neighboring AS, intending that the traffic take a different route to that taken by the traffic originating in the neighboring AS (for that same destination). In English: „One AS cannot send any kind of traffic to force other AS to change the routing“ We can just try to influence routing via path attributes but final word is on remote AS how he will route the traffic.

BGP provides mechanisms by which a BGP speaker can inform its peers that a previously advertised route is no longer available for use. There are three methods by which a given BGP speaker can indicate that a route has been withdrawn from service:

a) the IP prefix that expresses the destination for a previously advertised route can be advertised in the WITHDRAWN ROUTES field in the UPDATE message, thus marking the associated route as being no longer available for use,

b) a replacement route with the same NLRI can be advertised, or

c) the BGP speaker connection can be closed, which implicitly removes all routes the pair of speakers had advertised to each other from service.

Changing the attribute(s) of a route is accomplished by advertising a replacement route.

BGP Routing Information Base

The Routing Information Base (RIB) within a BGP speaker consists of three distinct parts:

a) Adj-RIBs-In: The Adj-RIBs-In stores routing information learned from inbound UPDATE messages that were received from other BGP speakers. Their contents represent routes that are available as input to the Decision Process. You can display Adj-RIB-In with „show ip bgp neighbors neighbor-address received-routes“ but only if the soft-reconfiguration inbound enabled

b) Loc-RIB: The Loc-RIB contains the local routing information the BGP speaker selected by applying its local policies to the routing information contained in its Adj-RIBs-In. These are the routes that will be used by the local BGP speaker. BGP Decision Process is then applied and best route is selected. The next hop for each of these routes MUST be resolvable via the Routing Table. You can display Loc-RIB with the „show ip bgp“ command. This command also displays the best selected BPG paths which is installed in global RIB (routing table).

c) Adj-RIBs-Out: The Adj-RIBs-Out stores information the local BGP speaker selected for advertisement to its peers. The routing information stored in the Adj-RIBs-Out will be carried in the local BGP speaker’s UPDATE messages and advertised to its peers. You can display Adj-RIBs-Out with „show ip bgp neighbors neighbor-address advertised-routes“

See pictures below:

Correct process is as follows:

Step 1. Store the route in Adj-RIB-In and process inbound route policies. The route is stored in the Adj-RIB-In table in original state. The inbound route policy is applied based on the neighbor the route was received.
Step 2. Update the Loc-RIB. The BGP Loc-RIB database is updated with the latest entry. The Adj-RIB-In is cleared to save memory.
Step 3. Pass a validity check. Verify that the route is valid and that the next-hop address is resolvable in the global RIB. If the route fails, the route remains in the Loc-RIB table but does not process further.
Step 4. Compute the BGP best path. Identify the BGP best path and pass only the best path and its path attributes to Step 5.
Step 5. Install the BGP best path into global RIB and advertise to peers. Install the route into the global RIB, and process outbound route policy, store the nondiscarded routes in the Adj-RIB-Out, and advertise to BGP peers.

BGP Message format

Fixed header

                          1                   2                   3
      0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
      |                                                               |
      +                                                               +
      |                                                               |
      +                                                               +
      |                           Marker                              |
      +                                                               +
      |                                                               |
      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
      |          Length               |      Type     |
      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

Marker is 16 octet field set to all ones. With conjunction of Length it allows one to find next BGP message in TCP stream.

Length is 2 octet and display full length of BGP message including the header

Type is one octed and displays type of BGP message. There are following message types:

OPEN
UPDATE
NOTIFICATION
KEEPALIVE
Route Refresh (through separate RFC2918)

OPEN Message Format

After a TCP connection is established, the first message sent by each side is an OPEN message. If the OPEN message is acceptable, a KEEPALIVE message confirming the OPEN is sent back. (Violent Angry Hooligans Insult Other Ones)

                           1                   2                   3
       0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
       +-+-+-+-+-+-+-+-+
       |    Version    |
       +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
       |     My Autonomous System      |
       +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
       |           Hold Time           |
       +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
       |                         BGP Identifier                        |
       +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
       | Opt Parm Len  |
       +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
       |                                                               |
       |             Optional Parameters (variable)                    |
       |                                                               |
       +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

Version is one octet field which shows BGP protocol version. Current version is 4.

My Autonomous System is two octet field and display AS of local system

Hold Time is 2 octet unsigned integer indicates the number of seconds the sender proposes for the value of the Hold Timer. Upon receipt of an OPEN message, a BGP speaker MUST calculate the value of the Hold Timer by using the smaller of its configured Hold Time and the Hold Time received in the OPEN message. The default value of hold time is 180 seconds and keepalive is then 60 seconds.

BGP Identifier is 4 octet unsigned integer indicates the BGP Identifier of the sender.

Use the router-ID that was configured manually with the bgp router-id command.
Use the highest IP address on a loopback interface.
Use the highest IP address on a physical interface.

Optional Parameter Length displays total length of optional parameters.

Optional Parameters contains optional parameters encoded in TLV „triplet“ format. TLV format is well known for its extensibility thats why its so used in protocol structure. Check here for example the TLV encoding of ISIS. So the optional parameter was meant to be a way to extend BGP functionality. The most famous optional parameters is capabilities parameter defined in RFC5492. Capabilities is currently the only optional parameter (2016) It has type code 2. Type code 1 is Authentication optional parameter, but that is deprecated. Capabilities parameter is also encoded in TLV triplet format, just Type is renamed to Code. So its Code, Length, Value. And actually Capabilities is a parameter which extends BGP functionality. So why is there Optional Parameter with TLV which implements new parameter called Capabilities also with TLV and both are here to extend BGP functionality? Its because RFC4271 (BGP) defines that if BGP speaker receives optional parameter that is not recognized, it should tear down the whole BGP session. This definition is very bad and can cause lots of problems. However it was coded and implemented in all the network devices and it would be very problematic to change such behavior. Thats why capabilities came and define if BGP speaker cannot support some of the capability it will not tear down the whole bgp session. BGP speaker knows which capabilities are supported by examining capabilites optional parameter in open message from its BGP peer. Capabilites which are not supported are quietly ignored. The exact process is defined in RFC 5492:

A BGP speaker determines the capabilities supported by its peer by examining the list of 
capabilities present in the Capabilities Optional Parameter carried by the OPEN message
that the speaker receives from the peer. If a BGP speaker receives from its peer a capability
that it does not itself support or recognize, it MUST ignore that capability. In particular, the
Unsupported Capability NOTIFICATION message MUST NOT be generated and the BGP session MUST NOT be
terminated in response to reception of a capability that is not supported by the local speaker.

UPDATE

UPDATE messages are used to transfer routing information (NLRI) between BGP peers. The information in the UPDATE message can be used to construct a graph that describes the relationships of the various Autonomous Systems. An UPDATE message MAY simultaneously advertise a feasible route and withdraw multiple unfeasible routes from service.

(Wet Women Pussy Punish Nudist)

      +-----------------------------------------------------+
      |   Withdrawn Routes Length (2 octets)                |
      +-----------------------------------------------------+
      |   Withdrawn Routes (variable)                       |
      +-----------------------------------------------------+
      |   Total Path Attribute Length (2 octets)            |
      +-----------------------------------------------------+
      |   Path Attributes (variable)                        |
      +-----------------------------------------------------+
      |   Network Layer Reachability Information (variable) |
      +-----------------------------------------------------+

Withdrawn Routes Length – indicates the total length of withdrawn routes field. The 0 value indicates that no routes are being withdrawn from service and that the withdrawn routes field is not present in this update message.

Withdrawn Routes – this field contains list of IP address prefixes that are being withdrawn from the service.

Path Attribute Length – indicates the total length of the Path Attributes field in octets. A value of 0 indicates that neither the Network Layer Reachability Information field nor the Path Attribute field is present in this UPDATE message.

Path Attributes – here are BGP attributes for each NLRI. Its a variable-length sequence of path attributes which is present in every UPDATE message, except for an UPDATE message that carries only the withdrawn routes. Each path attribute is a triple <attribute type, attribute length, attribute value> of variable length.

The attribute type is two octet field that consist of attribute flags octet and attribute type code octet.

Attribute Flags (Oil Towns Polute Earth)

The first high order bit (bit from the left. high order means in binary starting from high number which is from left. You can also imagine as descending order) is the optional bit, setting this bit to 1 means the attribute is optional and to 0 defines a well-known attribute.

The second high order bit is the transitive bit. It defines whether the attribute is transitive (value=1) or non-transitive (value=0). Well-known attributes are always transitive and therefore their transitive bit is always set to one.

The third bit is the partial bit, it defines whether the information in the optional transitive attribute is partial (value= 1) or complete (value = 0). Well-known and optional non-transitive are always set to complete. The partial bit is set to 1 in the following cases:

Unrecognized optional transitive attribute that is passed to peers, the sender sets the partial bit.
Optional transitive attribute attached by some router other than the originator or the route.

The fourth bit is the extended length bit and it defines whether the attribute length is one octet or more. The last four bits are not currently used.

The following descriptions elaborate on the significance of each attribute category:

Well-known mandatory – An attribute that has to exist in the BGP UPDATE packet. It must be recognized by all BGP implementations. If a well-known attribute is missing, a NOTIFICATION error is generated, and the session is closed. This is to make sure that all BGP implementations agree on a standard set of attributes. An example of a well-known mandatory attribute is the AS_PATH attribute.
Well-known discretionary – An attribute that is recognized by all BGP implementations but that might or might not be sent in the BGP UPDATE message. An example of a well-known discretionary attribute is LOCAL_PREF.

In addition to the well-known attributes, a path can contain one or more optional attributes. Optional attributes are not required to be supported by all BGP implementations. Optional attributes can be transitive or nontransitive:

Optional transitive – If an optional attribute is not recognized by the BGP implementation, that implementation looks for a transitive flag to see whether it is set for that particular attribute. If the flag is set, which indicates that the attribute is transitive, the BGP implementation should accept the attribute and pass it along to other BGP speakers. Note that Partial bit is set in the Attributes Flags octet to 1.
Optional nontransitive – When an optional attribute is not recognized and the transitive flag is not set, which means that the attribute is nontransitive, the attribute should be quietly ignored and not passed along to other BGP peers.

Transitivity is defined „against peers“ by RFC 4271 (BGP) and not against „ASes“. Transitivity is used to inform peer what to do if that peer doesnt recognize the optional attribute. In case attribute is transitive the peer should send it to other peer and set the partial bit in Attribute Flags octet to „1“. If the attribute is non-transitive the peer should quietly ignore it and not send to other peer. This is the transitivity rule. However be ware that there might be specific rules (exceptions) for the attributes itself (like for MED). According to transitivity rule the MED attribute should be always sent to peers if this attribute is recognized. However MED is not sent to other eBGP peers (ASes) if received from some eBGP peer. Its because this explicit rule is specified by RFC 4271 in the MED attribute section (section 5.1.4):

„If received over EBGP, the MULTI_EXIT_DISC attribute MAY be propagated over IBGP to other BGP speakers within the same AS. The MULTI_EXIT_DISC attribute received from a neighboring AS MUST NOT be propagated to other neighboring ASes.“

Origin Attribute

The ORIGIN attribute is a well-known mandatory attribute (Type Code 1) that indicates the origin of the routing update with respect to the autonomous system that originated it. BGP considers three types of origins:

IGP – network, aggregate-address (in some cases) and neighbor default-originate commands.
EGP – old EGP protocol (not supported in IOS anymore)
Incomplete – redistribute, aggregate-address (in some cases), and default-information originate command

Depending on the method used to inject a route into a local BGP table, BGP assigns one of three BGP ORIGIN PA codes: IGP, EGP, or incomplete. The ORIGIN PA provides a general descriptor as to how a particular NLRI was first injected into a router’s BGP table. The show ip bgp command list the actual ORIGIN code for each BGP route at the far right of each output line.

The rules regarding the ORIGIN codes used for summary routes created with the aggregate-address command can also be a bit surprising. The rules are summarized as follows:

If the as-set option is not used, the aggregate route uses ORIGIN code „i“.
If the as-set option is used, and all component subnets being summarized use ORIGIN code „i“, the aggregate has ORIGIN code „i“.
If the as-set option is used, and at least one of the component subnets has an ORIGIN code ?, the aggregate has ORIGIN code „?“.

Quite interesting when you redistribute some route to BGP with redistribute command and then you also inject the same route into BGP with network command, the network command has a preference and ORIGIN code will be „i“.

BGP FSM for neighbor session establishment

The BGP FSM is very long and boring topic having lots of possible events and transitions between different states during these events. To confusion to topic, the cisco ios, iosxr BGP FSM is not RFC compliant.

How it should work according to RFC is (very simplified):

Idle – in this state router does not listen to TCP connection and does not initiate TCP connection.
- When there is an event to start the BGP (administrator configured the BGP) it should initialize all BGP resources for the peer, sets the ConnectRetryTimer to initial value, initialize the TCP connection, listen for incoming connections, changes state to Connect
- If you configure the BGP peer with transport connection-mode passive (should not initialize the TCP connection) it will initialize all BGP resources for the peer, sets the ConnectRetryTimer to initial value, and only listen for incoming connections, then changes state to Active
Connect – In this state, BGP FSM is waiting for the TCP connection to be completed. Nothing more (the TCP session was initialized by the event in previous state)
- In case of TCP connection succeeds event, complete the BGP initialization, sends an OPEN message and change its state to OpenSent
- In case of TCP connection failure event, continue to listen for BGP connection and change its state to Active
- In case of ConnectRetryTimer Expired event, drop the TCP connection and initiate new TCP connection, continue to listen for a incoming connections, stay in Connect state
Active – In this state, BGP FSM is trying to acquire a peer by listening for, and accepting, a TCP connection.
- In case of TCP connection succeeds event, complete the BGP initialization, sends and OPEN message to peer, change its state to OpenSent
- In case of TCP connection failure event, release all BGP resource and changet its state to Idle
- In case of ConnectRetryTimer Expired event, initiate new TCP connection, continue to listen for a incoming connections, change to Connect state
OpenSent – In this state, BGP FSM waits for an OPEN message from its peer.
- When an OPEN message is received, all fields are checked for correctness. If there are no errors in the OPEN message, the event is generated and local system sends a KEEPALIVE message, sets the keepalivetimer and holdtimer and changes its state to OpenConfirm
- If there are errors in the OPEN message, the speaker release all BGP resources, drops the TCP connection, sends the NOTIFICATION message and changes its state to Idle. Exactly same things happens when there is a collision detected during this state
OpenConfirm – In this state, BGP waits for a KEEPALIVE or NOTIFICATION message.
- If the local system receives a KEEPALIVE message, restarts the HoldTimer and changes its state to Established
- If the local system receives a NOTIFICATION message, release all BGP resources, drops the TCP connection and changes its state to IDLE
- Note that KEEPALIVE messages start exchanging in this state. In case of KeepaliveTimer Expired event, send the KEEPALIVE message, restart the keepalive timer and remains in OpenConfirm state
Established – In the Established state, the BGP FSM can exchange UPDATE, NOTIFICATION, and KEEPALIVE messages with its peer. Each time the local system sends a KEEPALIVE or UPDATE message, it restarts its KeepaliveTimer.

Thats the RFC theory. Reality in Cisco IOS and IOSXR looks different:

When you enable BGP on routers, they go directly from IDLE to ACTIVE state even without transport connection-mode passive configured.
It will not try to initiate TCP session when going from IDLE to ACTIVE immediately, instead TCP initialization is delayed (in debug TCP OpenActive delayed message)
When the TCP initialization delay ends the router tries to initiate TCP connection (tcp open active msg in debug). If successfull it will go to OpenSent state. Naturally if this speaker is tcp open active, the peer is tcp open passive. Strange is that after tcp open passive the peer goes from Active to Idle to Connect and continue with OpenSent.
The rest is probably same as per RFC. I didnt check that closely other states.