Overblog Suivre ce blog
Editer l'article Administration Créer mon blog
3 décembre 2012 1 03 /12 /décembre /2012 18:28

This post describes how multicast Next-Hop resolution and caching is done on Junos MX routers with Trio based cards. To have more information regarding the multicast replication see my previous post here: Multicast Replication


1/ Introduction:

 

PIM join / prune sent by downstream routers allow to create/remove nodes/leaves to the multicast tree. When an MX router receives a PIM join for a given (S;G), the kernel allocates first to this new mcast entry (inet.1) a multicast next-hop that refers to a list of outgoing interfaces (called OIL). Each combination of OIL has its own multicast next-hop. A given OIL, or mcast NH, can be used by several streams; see example below:

 

sample4

 

Kernel Multicast NH Allocation - there are 2 cases: 

 

[1] If the PIM Join (S;G), receved, refers to a known combination of output interfaces (known by the kernel), the kernel allocates this multicast NH (link to the known OIL) to the multicast route. The kernel then sends to PFEs the new multicast route.

 

            Mcast Route (S;G) > OILx (NHx)

 

OILx and NHx already known by the kernel (because used by other mcast streams)

 

[2] If the PIM Join (S;G), received, triggers a new combination of output interfaces (unknown by the kernel), the kernel generates a new Multicast NH to the multicast route. The kernel then sends to PFEs the new multicast route, and the new multicast NH + OIL

 

            Mcast Route (S;G) > OILy (NHy)

 

OILy and NHy are created by the kernel (because unused by other mcast streams)

 

2/ In practical:

 

Let’s start to play with cli and pfe commands to better understand those concepts. The setup is depicted below. The mcast stream is there: (10.128.1.10;232.0.7.1).  


 

sample5

 

Patrick sends a PIM Join to Bob:

 

The PIM Join (10.128.1.10;232.0.7.1) is handled by the Bob RE’s (RPD) which first creates a new PIM join entry:

 


sponge@bob> show pim join 232.0.7.1 extensive

Instance: PIM.master Family: INET

R = Rendezvous Point Tree, S = Sparse, W = Wildcard

 

Group: 232.0.7.1

    Source: 10.128.1.10

    Flags: sparse

    Upstream interface: ae90.0

    Upstream neighbor: Direct

    Upstream state: Local Source

    Keepalive timeout:

    Uptime: 03:34:03

    Downstream neighbors:

        Interface: ae91.0

            10.128.1.2 State: Join Flags: S Timeout: 178

            Uptime: 03:34:03 Time since last Join: 00:00:01


 

Then, kernel checks if the OIL for (10.128.1.10;232.0.7.1) already exists. You can check “known OIL / NH” mapping via the following command:

 


sponge@bob> show multicast next-hops

Family: INET

ID          Refcount KRefcount Downstream interface

1048587            2         1 ae101.0


 

In our case, the stream OIL’s made of ae91.0 is unknown. So Kernel allocates a new multicast NH for our OIL and then creates a new mcast route within inet.1 table:

 


sponge@bob> show route table inet.1 detail

232.0.7.1.10.128.1.10/64(1 entry, 1 announced)

        *PIM    Preference: 105

                Next hop type: Multicast (IPv4) Composite, Next hop index: 1048590

                Address: 0x11e49f3c

                Next-hop reference count: 2

                State: <Active Int Ext>

                Local AS: 65000

                Age: 2:53

                Task: PIM.master

                Announcement bits (1): 0-KRT

                AS path: I

                AS path: Recorded


 

And now multicast NH is added there:

 


sponge@bob> show multicast next-hops

Family: INET

ID          Refcount KRefcount Downstream interface

1048584            2         1 ae91.0

1048587            2         1 ae101.0


 

After that the kernel creates a multicast forwarding cache entry. In our case the multicast sender does not send multicast stream yet. By default the cache lifetime is equal to 360 seconds (RFC 4601 recommends 210 sec for KAT). This timeout of this entry is reseted each time the router receives a packet (data) referring to this entry. In our case the cache lifetime decreases (sender does not send traffic).

 


sponge@bob> show multicast route group 232.0.7.1 extensive

Instance: master Family: INET

 

Group: 232.0.7.1

    Source: 10.128.1.10/32

    Upstream interface: ae90.0

    Downstream interface list:

        ae91.0

    Session description: Source specific multicast

    Statistics: 0 kBps, 0 pps, 0 packets

    Next-hop ID: 1048590

    Upstream protocol: PIM

    Route state: Active

    Forwarding state: Forwarding

    Cache lifetime/timeout: 349 seconds

    Wrong incoming interface notifications: 0

    Uptime: 00:00:11


 

In // kernel creates: mcast route, OIL, NH, on PFEs. Via a RE shell command (need root uid) you can monitor Kernel updates. The command is rtsockmon

 


# rtsockmon -t

[15:23:31] rpd      P    NH-comp    add     nh=comp idx=1173 refcnt=0, af=inet, tid=0 hw_token=0 fn=multicast comp=1048583, derv=1339,1341,

 

[15:23:31] rpd      P    nexthop    add     inet nh=indr flags=0x4 idx=1048590 ifidx=0 filteridx=0

 

[15:23:31] rpd      P    route      add     inet 232.0.7.1,10.128.1.10 tid=0 plen=64 type=user flags=0x800e nh=indr nhflags=0x4 nhidx=1048590 rt_nhiflist = 0 altfwdnhidx=0 filtidx=0


 

What we see?

 

Kernel first creates a composite NH 1173 (CNH) that contains (will see later) the binary tree. Then it links this composite NH to an Indirect NH (INH) 1048590 (the entry point of the NH chain). And finally, it creates the multicast route: a /64 route (G.S/64) which is linked to the INH.

 

The NH chain can be summarized as follow:

 

232.0.7.1.10.128.1.10/64 à INH 1048590 à CNH 1173 à LIST OF NHs (each NH could be Unicast or Aggregate NH)

 

To view the complete NH chain I recommend to use the following PFE command:

 


NPC3(bob vty)# show route long_ip prefix 232.0.7.1.10.128.1.10 detail

IPv4 Route Table 0, default.0, 0x0:

Destination   NH IP Addr      Type     NH ID Interface

------------  --------------- -------- ----- ---------

232.0.7.1.10.128.1.10/64                          Indirect 1048590 RT-ifl 331

 

Nexthop details:

1048590(Indirect, IPv4, ifl:0:-, pfe-id:0, i-ifl:0:-)

    1173(Compst, IPv4, ifl:0:-, pfe-id:0, comp-fn:multicast)

        1048583(Aggreg., IPv4, ifl:332:ae91.0, pfe-id:0)

            1378(Unicast, IPv4, ifl:395:xe-3/1/0.0, pfe-id:13)

            1387(Unicast, IPv4, ifl:400:xe-3/3/1.0, pfe-id:15)

 

  RT flags: 0x800e, Ignore: 0x00000000, COS index: 0

  DCU id: 0, SCU id: 0,  RPF ifl list id: 0

 


 

If multicast stream start before the keepalive timer expired there will be no kernel resolution and all packets will be handled by the PFE only. In other words, the ingress PFE will check the RFP (check if source is reachable by the ingress interface), then the lookup will be done and the ingress PFE will find the right INH and finally the Composite NH.

 

The composite NH lists the different combinations of Dual Binary tree (or unary tree if you use enhanced-IP mode at chassis level). Indeed, when you have LAG as outgoing interfaces there are several possibilities of binary tree. Let’s take a simple example as follow:

 

 sample3

 

AE2 is a LAG with a single member, so for every combinations of S and G there is only one forwarding NH (xe-1/3/0.0). On the other hand, AE3 is made of 2 members. So load balancing will be done based at least on S and G. As you can see in this case there are 2 possible binary trees. One where member xe-3/3/0.0 is selected for a given (S;G) and another one when xe-3/2/0.0 is selected.

 

If we take again our example, you can see the different combinations of binary tree by resolving the Composite NH: (remember AE91 is made of 2 10Ge interfaces)

  


NPC3(bob vty)# show nhdb id 1173

   ID      Type      Interface    Next Hop Addr    Protocol       Encap     MTU       Flags  PFE internal Flags

-----  --------  -------------  ---------------  ----------  ------------  ----  ----------  --------------------

 1173    Compst  -              -                      IPv4             -     0  0x00000000 0x00000000

 

[…]

 

Kernel nh-list:

1048583(Aggreg., IPv4, ifl:332:ae91.0, pfe-id:0)

    1378(Unicast, IPv4, ifl:395:xe-3/1/0.0, pfe-id:13)

    1387(Unicast, IPv4, ifl:400:xe-3/3/1.0, pfe-id:15)

 

#sublists: 2

#fwd-nhs per sublist: 1

Expanded sublists:

Sublist 0:

1378(Unicast, IPv4, ifl:395:xe-3/1/0.0, pfe-id:13)

 

Sublist 1:

1387(Unicast, IPv4, ifl:400:xe-3/3/1.0, pfe-id:15)

 

--------------

Sublist 0:

mcast-tree:

nfes:1, hash:0

13,

        Root

    13

 

reverse-mcast-tree:

nfes:1, hash:0

13,

        Root

    13

 

 

[..]

 

Sublist 1:

mcast-tree:

nfes:1, hash:0

15, 

        Root

    15

 

reverse-mcast-tree:

nfes:1, hash:0

15, 

        Root

    15

[...]


 

NB. The reverse tree is explained in a previous post.  Multicast Replication

 

Binary tree combinations are called “Sublist” . In our case OIL is made of AE91.0 that it made of 2 members: xe-3/1/0.0 (hosted at PFE 13) and xe-3/3/1.0 (hosted at PFE 15).

 

At the end of the multicast lookup a composite NH is found, then the hashing algorithm selects the right sublist for the given (S;G). Indeed, as the hash keys are configurable (and can add more fields, like layer-4), the composite next hop resolution (see above) can just provide the combinations of the binary tree and not the one selected at the end.

 

 3/ Kernel Resolution:

 

When PIM prune is received, the multicast entry is removed at PFE and RE level. But what’s happen when PIM joins are still sent periodically (every minutes) but multicast stream stops during at least the keepalive timeout (default 360 secs.) ?

 

When no packet of a given (S;G) is received after Keepalive timeout expired, the multicast cache entry is removed at RE level but also at PFE level. In our example, if my sender stops sending (10.128.1.10,232.0.71), you can see that cache entry will be removed at RE level (after 360 seconds):

 


sponge@bob> show multicast route group 232.0.7.1

empty

sponge@bob> show route table inet.1 | match 232.0.7.1

empty


 

But also at PFE level after those kernel updates (rtsockmon traces):

   


# rtsockmon -t

[15:54:48] rpd      P    nexthop    delete  inet nh=indr flags=0x6 idx=1048584 ifidx=0 filteridx=0


[15:54:48] rpd      P    NH-comp    delete  nh=comp idx=1174 refcnt=0, af=inet, tid=0 hw_token=0 fn=multicast comp=1048583, derv=1179,1182,1241,1341,


[15:54:48] rpd      P    route      delete  inet 232.0.7.1,10.128.1.10 tid=0 plen=64 type=user flags=0x818e nh=indr nhflags=0x4 nhidx=1048584 rt_nhiflist = 0 altfwdnhidx=0 filtidx=0


 

Note: PIM entry is still there:

   


sponge@bob > show pim join 232.0.7.1 extensive

Instance: PIM.master Family: INET

R = Rendezvous Point Tree, S = Sparse, W = Wildcard

 

Group: 232.0.7.1

    Source: 10.128.1.10

    Flags: sparse

    Upstream interface: ae90.0

    Upstream neighbor: Direct

    Upstream state: Local Source

    Keepalive timeout:

    Uptime: 2d 23:27:55

    Downstream neighbors:

        Interface: ae91.0

            10.128.1.2 State: Join Flags: S Timeout: 147

            Uptime: 01:45:34 Time since last Join: 00:00:33


 

Now, when the multicast stream starts again, a lookup of the first packet is performed, but the mcast route has been removed of the PFE, so lookup result will be a KERNEL Resolve NH:

   


NPC3(bob  vty)# show route long_ip prefix 232.0.7.1.10.128.1.10 detail

IPv4 Route Table 0, default.0, 0x0:

Destination   NH IP Addr      Type     NH ID Interface

------------  --------------- -------- ----- ---------

224/4                                              Resolve   983 RT-ifl 0

 

Nexthop details:

983(Resolve, IPv4, ifl:0:-, pfe-id:0)

 

  RT flags: 0x000a, Ignore: 0x00000000, COS index: 0

  DCU id: 0, SCU id: 0,  RPF ifl list id: 0


 

Then a notification with the S and G fields of the first multicast packet is punted to the RE (Kernel). Resolution is sent over em0 internal interface.

 

FPC 3 that, hosts ae90 (ingress PFE that receives the stream), sends resolve request to RE Master

   


17:09:14.564414  In IP (tos 0x0, ttl 255, id 14405, offset 0, flags [none], proto: TCP (6), length: 88) 128.0.0.19.14340 > 128.0.0.1.6234: P 258841:258877(36) ack 82496 win 65535 <nop,nop,timestamp 428760235 430306943>

        -----IPC message payload packet-----

                Packet 1:

                        type: next-hop(6), subtype: resolve addr request(14), length: 20, opcode: 1, error: 0,

[|ipc]

                         0201 0000 0005 0200 0000 0013 0800 4500

                         0058 3845 0000 ff06 8346 8000 0013 8000

                         0001 3804 185a 0b14 e1e9 213a 0fe9 8018

                         ffff 7d78 0000 0101 080a 198e 5cab 19a5

                         f67f 0100 0000 1c00 0000 0600 0e00 1400

                         0100 0000 03d7 0000 014b 0040 02ef e800

                         0701 0a80 010a

N.B.:

e8000701 = 232.0.7.1

0a80010a = 10.128.1.10  

 

sponge@bob> show tnp addresses | match "fpc3|master" | match em0

master                   0x1 02:01:00:00:00:05 em0  1500 0 0 3

fpc3                    0x13 02:00:00:00:00:13 em0  1500 4 0 3

 

0x13= 19 = internal IP address of FPC3 = 128.0.0.19

0x1= 1 = internal IP address of Master RE = 128.0.0.1 


 

Kernel first checks if a PIM entry is available for this (S;G) and then allocates a new INH / Composite NH for this multicast route. The multicast cache entry is created at RE level and kernel creates again the different NHs and the multicast route on PFEs.

 


# rtsockmon -t

[16:44:16] rpd      P    NH-comp    add     nh=comp idx=1174 refcnt=0, af=inet, tid=0 hw_token=0 fn=multicast comp=1048583, derv=1179,1182,1241,1341,

 

[16:44:16] rpd      P    nexthop    add     inet nh=indr flags=0x4 idx=1048584 ifidx=0 filteridx=0

 

[16:44:16] rpd      P    route      add     inet 232.0.7.1,10.128.1.10 tid=0 plen=64 type=user flags=0x800e nh=indr nhflags=0x4 nhidx=1048584 rt_nhiflist = 0 altfwdnhidx=0 filtidx=0


 

You can check kernel resolution “hits” via this following cli command:

 


sponge@bob> show multicast statistics

Instance: master Family: INET

Interface: ae90.0

    Routing protocol:          PIM   Mismatch error:               0

    Mismatch:                    0   Mismatch no route:            0

    Kernel resolve:              1   Routing notify:               0

    Resolve no route:            0   Resolve error:                0

    Resolve filtered:            0   Notify filtered:              0

    In kbytes:            11964471   In packets:             9156666

    Out kbytes:                  0   Out packets:                  0 


 

Next packets will be resolved at PFE level by a more specific multicast route previously added by the kernel .

 

Note: You can experience some packet drops during the kernel resolution process.

 

In scaled multicast networks, kernel resolution might request high kernel consumption "ticks". To avoid that, Juniper has throttled resolution requests at PFE level. Each TRIO based card is limited to 66 resolutions per second. This PFE command gives you this information:

 


NPC3(bob vty)# show nhdb mcast resolve

Nexthop Info:

   ID      Type    Protocol    Resolve-Rate

-----  --------  ----------  ---------------

  983   Resolve        IPv4               66


 

To override the default cache timeout (360s) you can use this following command:

  


edit exclusive

set routing-options multicast forwarding-cache timeout ?

Possible completions:

  <timeout>            Forwarding cache entry timeout in minutes (1..720) 


 

And just for fun and on a lab, you can override the resolve-rate (default 66/sec) with this following HIDDEN command:

   


edit exclusive

set forwarding-options multicast resolve-rate ?

Possible completions:

  <resolve-rate>       Multicast resolve rate (100..1000 per second)


 

 

4/ Can I do a DOS to the Kernel if I try to play with multicast kernel resolution?

 

NOOOOOO !!!!  

 

Indeed, you can imagine to send a lot of multicast packets with random S and G. Each new stream will trigger a kernel resolution. Resolve request will be first rate-limited by the resolve-rate on MPC (default 66/sec) and then by a mulicast discard mechanism explained below.

 

As I explained, the first packet of a given (S;G) triggers a kernel resolution if a multicast route is not found at PFE level. The kernel that receives the resolve request will first check if the (S;G) matches a known PIM entry. If not, the RE first sends a PIM Prune to the upstream node to force the upstream router to stop the forwarding toward itself. If the router is directly connected to the source it can’t send a PIM prune. So in // the kernel adds a specific route for the given (S;G) that will discard this stream at PFE level (no more kernel resolution toward the Kernel(RE) will be requested for this (S;G))

 

Example: Sender sends an unknown multicast stream (10.128.1.10,232.0.8.1).

 

On Bob no pim entry is available for this stream

 


sponge@bob> show pim join 232.0.8.1

Instance: PIM.master Family: INET

R = Rendezvous Point Tree, S = Sparse, W = Wildcard 


 

Resolved request for (10.128.1.10;232.0.8.1) is sent by ae90.0 to the Kernel.

 

As I said previously, in this architecture Bob can’t send a PIM prune for this (S;G) because it’s directly connected to the source, but it sends a kernel route add request for this (S;G) but with a specific NH (NH=35=multicast discard)

 


# rtsockmon -t

[18:20:58] rpd      P    route      add     inet 232.0.8.1,10.128.1.10 tid=0 plen=64 type=user flags=0xe nh=mdsc nhflags=0x0 nhidx=35 rt_nhiflist = 0 altfwdnhidx=0 filtidx=0


 

At PFE level the multicast route is now:

 


NPC3(bob vty)# show route long_ip prefix 232.0.8.1.10.128.1.10 detail

IPv4 Route Table 0, default.0, 0x0:

Destination   NH IP Addr      Type     NH ID Interface

------------  --------------- -------- ----- ---------

232.0.8.1.10.128.1.10/64                          mdiscard    35 RT-ifl 331

 

Nexthop details:

35(mdiscard, IPv4, ifl:0:-, pfe-id:0)

 

  RT flags: 0x000e, Ignore: 0x00000000, COS index: 0

  DCU id: 0, SCU id: 0,  RPF ifl list id: 0 


 

A PIM join for this (S;G) can change the NH or if the sender stops sending this stream the entry will be removed automatically after the Keepalive timer expires (default 360s)

 

 

David.

Partager cet article

Repost 0
Published by junosandme - dans Posts
commenter cet article

commentaires