Overblog Suivre ce blog
Editer l'article Administration Créer mon blog
16 janvier 2012 1 16 /01 /janvier /2012 21:00

I spent time on Junos routers (T and MX series) to understand how multicast replication works. I carried out some tests in lab to better understand how the multicast replication is done at the PFE level. This study was very useful for some multicast issues that I experienced in my operational network. The aim of this post is to share with you what I guess understand icon_wink.gif

 

Patent search engines are usually good tools to obtain technical information. For this subject I've found 2 patents written by Juniper guys that provided me a lot of relevant information:

 

- Patent US 7,710,963: Binary Trees for multicast traffic

- Patent US 7,263,099: Multicast Packet replication

 

Note : I use some PFE commands that are not supported by the JTAC. Although I've never encountered an issue with these given commands (even on operational nodes), be carreful with them icon_wink.gif

 

1/ Basic information

 

Multicast replication, on Junos, is done at PFE (Packet Forwarding Engine) level. PFE is the entity within the linecard (FPC, DPC, MPC...) that manages : packet handling, route lookup, CoS, packet encapsulation, packet forwarding...

On a given card, there are more than one PFE. Each PFE supports more or less traffic and "connects" one or more physical ports. For instance : 

- FPC 3 on T640 : 2 PFE (each PFE supports 20Gbits/s and connects 2x10GE links or 20x1GE links) 

- FPC 4 on T1600 : 2 PFE (each PFE supports 40Gbit/s and connects for example 4x10GE PIC) 

- DPCE 4x10GE on MX960 : 4 PFE  (each PFE supports 10Gbits/s and connects 1x10GE link)

- MPC 3D 16x10GE on MX960 : 4 PFE (each PFE supports 30Gbits/s and connects 4x10GE links)

...

 

2/ PFE numbering


On a given slot, PFE are numbered from 0 to X from the top to the bottom (exception for DPCE 40x1GE). For example for the DPCE 4x10GE. The PFE #0 connects the link xe-x/0/0, PFE #1 xe-x/1/0, PFE #2 xe-x/2/0 and PFE #3 xe-x/3/0

 

On a given chassis, PFE are numbered form 0 to N (where N is the maximum number of PFE in a chassis). For a current MX960 chassis we can have at most 12 Linecards of 4 PFE. So, N = 12x4 -1 = 47. So for an MX960, PFE are numbered from the left top (linecard 0) to the right bottom (Linecard 11). For example, a DPCE 4x10GE installed into the slot 4, has its PFE numbered (at the chassis level) : PFE #12 for the link xe-4/0/0, PFE #13 for the link xe-4/1/0, PFE #14 for the link xe-4/2/0, PFE #15 for the link xe-4/3/0.  See the diagram below :

 

 

mymx 

 

 

3/ Multicast Next-hop

 

On Junos, a multicast next-hop is an Indirect next-hop that represents for a given multicast stream (S;G) (S= Source , G= Group) the OIL : Outgoing Interface List (the replication tree). Find below an example : 

Note : multicast routes are also available in the inet.1 table. 

 


show multicast route group 232.0.1.1 

Family: INET

Group: 232.0.1.1

    Source: 10.1.1.1/32 

    Upstream interface: xe-4/0/1.0

    Downstream interface list: <------ This is the OIL

          xe-3/0/2.0 xe-3/1/1.0 xe-6/1/0.0 xe-6/0/3.0


 

For the stream (10.1.1.1,232.0.1.1) the stream received by the RPF (Reverse Path Forwarding) interface xe-4/0/1.0 will be replicated to 4 interfaces (listed in the OIL). The OIL could be the same for another (S;G) (it will depend of the multicast subscriptions received on the downstream interfaces). 

 

To see the multicast Next-hop ID of a given stream, use the detail keyword : 

   


show multicast route group 232.0.1.1 detail

Family: INET

Group: 232.0.1.1

    Source: 10.1.1.1/32 

    Upstream interface: xe-4/0/1.0

    Downstream interface list:

            xe-3/0/2.0 xe-3/1/1.0 xe-6/1/0.0 xe-6/0/3.0

    Session description: Source specific multicast   

    Statistics: 419 kBps, 308 pps, 227637824 packets

    Next-hop ID: 1049784

    Upstream protocol: PIM  


 

As I said previously, an Indirect Multicast Next-hop represents a given OIL. You can see how many multicast next-hops are used and how many streams are associated to each Next-hop : 

 


show multicast next-hops            

Family: INET

ID          Refcount KRefcount Downstream interface

1048927            4         2 xe-2/1/0.0

1048956            2         1 xe-2/1/0.0

                               xe-2/1/3.0

1048631            2         1 xe-2/1/0.0

                               xe-2/1/3.0

                               xe-3/1/0.0

1049784            6         3 xe-3/0/2.0

                               xe-3/1/1.0

                                xe-6/1/0.0

                               xe-6/0/3.0

 
For each Next-hop ID we have the detail of the OIL and the number of streams that use this OIL (Krefcount). For our previous example we have 3 (S,G) that have the same OIL, therefore the same replication tree.  To resolve the indirect next-hop I query the forwarding table :

show route forwarding-table multicast destination 232.0.1.1
Routing table: default.inet
Internet:
Destination        Type RtRef Next hop           Type Index NhRef Netif
232.0.1.1.10.1.1.1/64
                   user     0                    indr 1049784     4
                                                 mcrt  5481     1
 

The multicast "real" next-hop is 5481 at the PFE level. This one points to a Mcast replication list ID (we will see later).

4/ Binary Tree 

 

The 2 patents gave me a lot of info regarding the replication algorithm. I knew that each PFE can replicate at most twice a stream. In other words, a PFE that receives an stream (S;G) can replicate the stream for all its directly connected interfaces (Local replications) and at most for 2 other PFEs.  Note : in this configuration the internal links (HSL) between the PFE and the fabric might be oversubcribed by multicast.

 

Hereafter an example of a what we call a MULTICAST DUAL BINARY TREE. The binary tree is filled from the top to the botom and from the left to the right.

 

dual 

 

The root PFE is the PFE that receives the Stream: the PFE that hosts the RPF Interface (that can have also local replications).  Leaves PFEs are PFE that do not have to replicate a stream to other PFEs (ie. PFE 9 and 10). Non leaf PFE is a PFE that has to replicate a stream to other PFEs (ie. PFE 6 and 12).

 

With this algorithm we do not ensure that a PFE replicates at most twice a stream. Indeed, if the PFE #6 is the ingress PFE (the ROOT) , this one has to send twice the stream through the Fabric toward PFE #6 (itself) and PFE #12 and then PFE#6 has to replicate in the 2nd tier, to PFE #9 and PFE #10. Therefore, PFE #6 replicates 4 times the stream. To avoid this, the Juniper patent says : To ensure at most 2 replications per PFE, the ingress PFE has to be a Leaf PFE. But with one single instance of the binary tree we can't do that. So Juniper purposes to have for each Multicast Next-hop : 2 instances of the Binary tree, the first one like the diagram above, and the second one computed from the first one but in a reverse sequential order. With this 2 instances of binary tree, we still have one binary tree that has any PFE as a leaf PFE.

 

So when a PFE receives a multicast stream and is therefore selected as the ROOT PFE, the first step is to select the right instance of the binary tree. In other words, the instance where the ROOT PFE is a leaf. If the ingress PFE is not included in the 2 instances, so it meams that there is no local replication for the ROOT PFE, the ROOT PFE selects the first instance.

 

Hereafter, the modified example :

 

dual5 

 

As I said, if PFE#14 is the Ingress PFE, this PFE will select the instance 1 of the binary tree because this PFE does not have local replication so it does not appear in the binary tree. If PFE #9 or PFE #10 are the Ingress PFE, these PFE will select the instance 1, because only in this instance PFE #9 or PFE #10 are a leaf PFE. Finally if PFE #6 or PFE #12 are the Ingress PFE, then PFE will select instance 2 of the binary tree, to avoid more that 2 PFE replications for the PFEs.

 

5/ How to compute the tree 

 

The other patent gives interresting info but I can applicate the given algo to my setup.

 

As we seen previously, the binary tree is the result of the multicast replication list : the OIL.

 

Each PFE use the same algorithm so the Dual Binary Tree for a given OIL (Multicast Next-hop) will be the same on each PFE.

 

From the OIL, the PFE derives the list of the PFE. In our first example the OIL was : xe-3/0/2.0 xe-3/1/1.0 xe-6/1/0.0 xe-6/0/3.0

 

Note : The chassis, for this example, is a T1600 with Sauza cards, on a T1600 there are 2 PFEs per FPC. Each Sauza PIC connected to a PFE hosts 4 10GE links. We can have 2 Sauza cards per FPC.

 

From the OIL, we find the PFE list (Chassis numbering) PFE #6, PFE #7, PFE #13, PFE #12. Here, we have the same number of entry in the PFE list than the number of Interfaces in the OIL (but this is not always the case). Then the PFE sorts the PFE list : (6; 7; 12 ; 13). To avoid continually starting the packet replication at the same entry (in our case PFE 6) an algorithm is used to randomize the starting point. The algo provided by one of the patents (based on the IFL ID of the RPF Interface) seems not fully implemented in Junos. Some infos missing.

 

So the algo selects the starting PFE : the PFE at the left of the tier 2. Then, the rest of the entries (in the sorted PFE list) populates the binary tree horizontally from the tier 2 to tier X. Upon reaching the end of the PFE list, the next node may correspond to the PFE at the beginning of the list until we reach the starting PFE.

 

In our example we suppose that the starting PFE selected by the algo is PFE #13. So each PFE computes the first instance of the binary tree and the reserve one (the second instance). The result for our case will be:

 

dual4 

 

In our case the first instance will be used, because Ingress PFE is PFE #8 that does not have Local Replication : Indeed, Ingress interface is xe-4/0/1.0. 

 

6/ Troubleshooting the tree for LMNR and Ichip architectures 

 

I take again the following example (the dual binary tree is above - part 5) :

 


show multicast route group 232.0.1.1 detail

Family: INET

Group: 232.0.1.1

    Source: 10.1.1.1/32 

    Upstream interface: xe-4/0/1.0

    Downstream interface list:

            xe-3/0/2.0 xe-3/1/1.0 xe-6/1/0.0 xe-6/0/3.0

    Session description: Source specific multicast   

    Statistics: 419 kBps, 308 pps, 227637824 packets

    Next-hop ID: 1049784 --> Resolved in 5481

    Upstream protocol: PIM  


 

The Ingress PFE is the PFE#8 hosted by the FPC 4 (xe-4/0/1.0). So let's go to the FPC 4 :

 


start shell pfe network fpc4


 

 Now taking more info regarding the multicast next-hop 5481 the Mcast NH associated to the Indirect NH.

 


(vty)# show nhdb id 5481 detail

   ID      Type      Interface    Next Hop Addr    Protocol       Encap     MTU

-----  --------  -------------  ---------------  ----------  ------------  ----

 5481   MultiRT  -              -                      IPv4             -  4470

                 xe-3/0/2.0                            IPv4      Ethernet

                 xe-3/1/1.0                            IPv4      Ethernet

                  xe-6/1/0.0                            IPv4      Ethernet

                 xe-6/0/3.0                            IPv4      Ethernet

 

Nexthop Status:

  Index: 5481 (0x1569), Slot: Unspecified

  Destination (0x200): Multicast Chip9

  Topo link: 0x00000000:0x4a6df201 0x00000000:0x4a996201

  Refcount: 1, Interface: 0, Nexthop Flags: 0x0

 

RE IFL List index:  89


 

Have a look to the previous diagram (part 5) at each step.

 

Now we can check the Dual binary tree associated to the mcast replication list 89 (see below). As we can see there are 2 instances. The first one is used. The ingress PFE is the PFE #8 (chassis level) but for the FPC point of view this is the PFE #0.

 

The terms r1 and r2 are, I guess, Replication 1 and Replication 2 (the 2 downstream PFE for the tier 1 so used by the ROOT only), f1 and f2, I guess are Downstream PFE for the other tiers). The nl flag is, I guess again, means “leaf node” (see patent) and for me has a meaning only for tier 2 and more (only when we use f1 and f2)

 

So the PFE #8 has to replicate to PFE #13 and PFE #6 (we use r1 and r2 because PFE#8 is the ROOT PFE in the tier 1). I guess nl is always equal to 1 for the Ingress PFE.

 


(vty)# show nhdb mcast tables

FE Multicast trees for proto(IPv4) :

idx  total  FE address list

---  -----  ------------------------------------------------

 82      3  12 5, 7

            7 5, 12

 89      4  13 6, 7 12  <<< instance 1 (used)

            12 7, 6 13  <<< instance 2 (reverse one)

 90      3  6 9, 11

            11 9, 6

 

Multicast list for PFE 0 (8), proto(IPv4) :

idx r1 r2 f1 f2 nl total single strm idx(chip/strm id)

--- -- -- -- -- -- ----- -----------------------------

 82 12  5        1     0

 89 13  6        1     0

 90  6  9        1     0

--- -- -- -- -- -- ----- -----------------------------

 

Multicast list for PFE 1 (9), proto(IPv4) :

idx r1 r2 f1 f2 nl total single strm idx(chip/strm id)

--- -- -- -- -- -- ----- -----------------------------

 82 12  5        1     0

 89 13  6        1     0

 90  6  9        1     2 1/13 1/15

 

In BLUE : PFE numbering at FPC level

In Red  : PFE numbering at Chassis Level


 

Now, go on to the next PFE of the tree (tier 2) : PFE #13 on FPC 6.

 

The PFE #13 is not a leaf PFE (nl = 0) , so it has to replicate the stream to PFE #7 (f1) and PFE #12 (f2). Here PFE #13 has one local replication. To resolve the local replication (to have the interface name) use the Local Per Stream IFL List, that gives us the IFL of the interface. Then resolve the IFL : cool ! this is xe-6/1/0.0 (one downstream interface of the OIL)

 


 start shell pfe network fpc6

 

(vty)# show nhdb mcast tables

FE Multicast trees for proto(IPv4) :

idx  total  FE address list

---  -----  ------------------------------------------------

 82      3  12 5, 7

            7 5, 12

 89      4  13 6, 7 12 <<< instance 1 used

            12 7, 6 13

 90      3  6 9, 11

 

Multicast list for PFE 0 (12), proto(IPv4) :

idx r1 r2 f1 f2 nl total single strm idx(chip/strm id)

--- -- -- -- -- -- ----- -----------------------------

 82  7  5  7     0     2 0/1 0/0

 89 13  6  6 13  1     1 0/0

 90  6  9        1     0

 

 

Multicast list for PFE 1 (13), proto(IPv4) :

idx r1 r2 f1 f2 nl total single strm idx(chip/strm id)

--- -- -- -- -- -- ----- -----------------------------

 82 12  5        1     0

 89 12  7  7 12  0     1 1/0 à See below the “Local Per Stream lists”

 90  6  9        1     0

 

Local Per Stream IFL lists:

 chip stream index ref  address  List

 ---- ------ ----- ---  -------- ------

    0     48     0  35  0x0000a4 105

    0     32     1  34  0x0000a8 104

    1      0     0  29  0x00000c 103  à IFL of the Local replication

 

In BLUE : PFE numbering at FPC level

In Red  : PFE numbering at Chassis Level

 

(vty)# show ifl 103

Logical interface xe-6/1/0.0 (Index 103, Alias-Index 0 Peer-Index 0)


 

Now, go on to the next PFE of the tree (tier 2) : PFE # 6 on FPC 3.

 

The PFE #6  is a leaf PFE (nl = 1), so f1 and f2 are not used and PFE replication stops. Here PFE #6 has one local replication. To resolve the local replication (to have the interface name) use the Local Per Stream IFL List, that gives us the IFL of the interface. Then resolve the IFL : this is xe-3/0/2.0 (another downstream interface of the OIL)

 


start shell pfe network fpc3

 

(vty)# show nhdb mcast tables

FE Multicast trees for proto(IPv4) :

idx  total  FE address list

---  -----  ------------------------------------------------

 82      3  12 5, 7

            7 5, 12

 89      4  13 6, 7 12 <<< instance 1 used

            12 7, 6 13

 90      3  6 9, 11

            11 9, 6

 

Multicast list for PFE 0 (6), proto(IPv4) :

idx r1 r2 f1 f2 nl total single strm idx(chip/strm id)

--- -- -- -- -- -- ----- -----------------------------

 82 12  5        1     0

 89 13  6        1     1 0/1 à See below the “Local Per Stream lists”

 90 11  9 11     0     1 0/1

--- -- -- -- -- -- ----- -----------------------------

 

Multicast list for PFE 1 (7), proto(IPv4) :

idx r1 r2 f1 f2 nl total single strm idx(chip/strm id)

--- -- -- -- -- -- ----- -----------------------------

 82 12  5 12     1     1 1/3

 89 13  6        1     1 1/1

 90  6  9        1     0

--- -- -- -- -- -- ----- -----------------------------

 

Local Per Stream IFL lists:

 chip stream index ref  address  List

 ---- ------ ----- ---  -------- ------

    0     48     0  41  0x0001a0 78

    0     32     1  39  0x0001a4 77 à IFL of the Local replication

     1      0     0  40  0x0000fc 79

    1     16     1  29  0x000100 81

    1     32     2  29  0x000104 80

    1     48     3  36  0x000108 82

 

In BLUE : PFE numbering at FPC level

In Red  : PFE numbering at Chassis Level

 

(vty)# show ifl 77

Logical interface xe-3/0/2.0 (Index 77, Alias-Index 0 Peer-Index 0)


 

Now, go on to the next PFE of the tree (tier 3) : PFE #7 on FPC 3.

 

The PFE #7  is a leaf PFE (nl = 1), therefore f1 and f2 are not used and PFE replication stops. Here PFE #7 has one local replication. To resolve the local replication (to have the interface name) use the Local Per Stream IFL List, that gives us the IFL of the interface. Then resolve the IFL : this is xe-3/1/1.0 (another downstream interface of the OIL)

 


start shell pfe network fpc3

 

(vty)# show nhdb mcast tables

FE Multicast trees for proto(IPv4) :

idx  total  FE address list

---  -----  ------------------------------------------------

 82      3  12 5, 7

            7 5, 12

 89      4  13 6, 7 12 <<< instance 1 used

            12 7, 6 13

 90      3  6 9, 11

            11 9, 6

 

Multicast list for PFE 0 (6), proto(IPv4) :

idx r1 r2 f1 f2 nl total single strm idx(chip/strm id)

--- -- -- -- -- -- ----- -----------------------------

 82 12  5        1     0

 89 13  6        1     1 0/1

 90 11  9 11     0     1 0/1

--- -- -- -- -- -- ----- -----------------------------

 

Multicast list for PFE 1 (7), proto(IPv4) :

idx r1 r2 f1 f2 nl total single strm idx(chip/strm id)

--- -- -- -- -- -- ----- -----------------------------

 82 12  5 12     1     1 1/3

 89 13  6        1     1 1/1 à See below the “Local Per Stream lists”

 90  6  9        1     0

--- -- -- -- -- -- ----- -----------------------------

 

Local Per Stream IFL lists:

 chip stream index ref  address  List

 ---- ------ ----- ---  -------- ------

    0     48     0  41  0x0001a0 78

    0     32     1  39  0x0001a4 77

    1      0     0  40  0x0000fc 79

    1     16     1  29  0x000100 81 à IFL of the Local replication

    1     32     2  29  0x000104 80

    1     48     3  36  0x000108 82

 

In BLUE : PFE numbering at FPC level

In Red  : PFE numbering at Chassis Level

 

(vty)# show ifl 81

Logical interface xe-3/1/1.0 (Index 81, Alias-Index 0 Peer-Index 0)


 

Finally, go on to the last PFE of the tree (tier 3) : PFE #12 on FPC 6.

 

The PFE #12  is a leaf PFE (nl = 1), therefore f1 and f2 are not used and PFE replication stops. Here PFE #12 has one local replication. To resolve the local replication (to have the interface name) use the Local Per Stream IFL List, that gives us the IFL of the interface. Then resolve the IFL : this is xe-6/0/3.0 (the last downstream interface of the OIL)

 


start shell pfe network fpc6

 

 

(vty)# show nhdb mcast tables

FE Multicast trees for proto(IPv4) :

idx  total  FE address list

---  -----  ------------------------------------------------

 82      3  12 5, 7

            7 5, 12

 89      4  13 6, 7 12 <<< instance 1 used

            12 7, 6 13

 90      3  6 9, 11

 

Multicast list for PFE 0 (12), proto(IPv4) :

idx r1 r2 f1 f2 nl total single strm idx(chip/strm id)

--- -- -- -- -- -- ----- -----------------------------

 82  7  5  7     0     2 0/1 0/0

 89 13  6  6 13  1     1 0/0 à See below the “Local Per Stream lists”

 90  6  9        1     0

 

 

Multicast list for PFE 1 (13), proto(IPv4) :

idx r1 r2 f1 f2 nl total single strm idx(chip/strm id)

--- -- -- -- -- -- ----- -----------------------------

 82 12  5        1     0

 89 12  7  7 12  0     1 1/0

 90  6  9        1     0

 

Local Per Stream IFL lists:

 chip stream index ref  address  List

 ---- ------ ----- ---  -------- ------

    0     48     0  35  0x0000a4 105 à IFL of the Local replication

    0     32     1  34  0x0000a8 104

    1      0     0  29  0x00000c 103 

 

In BLUE : PFE numbering at FPC level

In Red  : PFE numbering at Chassis Level

 

(vty)# show ifl 105

Logical interface xe-6/0/3.0 (Index 105, Alias-Index 0 Peer-Index 0)


 

Done.... icon_cool.gif 

 

 

7/ Troubleshooting the tree for TRIO architectures 

 

By default Trio cards use Dual binary Tree like DPC and FPC, but an enhanced feature allow the use of more bandwitdh between PFE and the Fabric. To use this enhanced feature you need to have only TRIO cards into the chassis and activate the feature via a chassis command.

 

Troubleshooting binary tree on Trio it is more complex (I have to do more tests in lab to better understand the command outputs). Nervertheless the following command seems to provide interresting info, with a "ascii binary tree". I will update this part soon. 

 


show nhdb id 1769 detail  --> 1769 is the resolved multicast nh
   ID      Type      Interface    Next Hop Addr    Protocol       Encap     MTU       Flags  PFE internal Flags
-----  --------  -------------  ---------------  ----------  ------------  ----  ----------  --------------------
 1769  Multi-RT  -              -                      IPv4             -     0  0x00000000 0x00000000
  Mcast List Index: 26 ()
PFE#0, child#0 = 1776
mcast-tree:
nfes:5, hash:1
38, 8, 11, 23, 35,
                Root
        38              08
    11      23      35 


 

David.

Partager cet article

Repost 0
Published by junosandme - dans Posts
commenter cet article

commentaires

raf 17/11/2015 14:02

btw they introduced QFX10000 recently and guess what, they keep that way of replication there. why?

raf 17/11/2015 14:00

binary replication is not very useful when it comes to fairness and equal latency for all receivers. this is the biggest problem for Juniper's replication. It is completely useless in finanse sector where fairines factor is most important one. Not sure why they cannot implement replication in switching fabric, spraying packets from PICs directly in round robin fashion into switching fabric and let SF replicate them whereever receiver is sitting.